What will be number of tokens(compiler)? - c

What will be number of tokens in following ?
int a[2][3];
I think tokens are -> {'int', '[', ']', '[', ']', ';'}
Can someone explain what to consider and what not while compiler calculates tokens ?
Thanks

Expanding on my comment:
How the input is tokenized is a function of your tokenizer (scanner). In principle, the input you presented might be tokenized as "int", "a", "[2]", "[3]", ";", for example. In practice, the most likely choice of tokenization would be "int", "a", "[", "2", "]", "[", "3", "]", ";". I am uncertain why you seem to think that the variable name and dimension values would not be represented among the tokens -- they carry semantic information and therefore must not be left out.
Although separating compiling into a lexical analysis step and a semantic analysis step is common and widely considered useful, it is not inherently essential to make such a separation at all. Where it is made, the choice of tokenization is up to the compiler. One ordinarily chooses tokens so that each represents a semantically significant unit, but there is more than one way to do that. For instance, my alternative example corresponds to a token sequence that might be characterized as
IDENTIFIER, IDENTIFIER, DIMENSION, DIMENSION, TERMINATOR
The more likely approach might be characterized as
IDENTIFIER, IDENTIFIER, OPEN_BRACKET, INTEGER, CLOSE_BRACKET, OPEN_BRACKET,
INTEGER, CLOSE_BRACKET, TERMINATOR
The questions to consider include
What units of the source contain meaningful semantic information in their own right? For instance, it is not useful to make each character a separate token or to split up int into two tokens, because such tokens do not represent a complete semantic unit.
How much responsibility you can or should put on the lexical analyzer (for instance, to understand the context enough to present DIMENSION instead of OPEN_BRACKET, INTEGER, CLOSE_BRACKET)
Updated to add:
The C standard does define the post-preprocessing language in terms of a specific tokenization, which for the statement you gave would be the "most likely" alternative I specified (and that's one reason why it's the most likely). I have answered the question in a more general sense, however, in part because it is tagged [compiler-construction].

Related

Why does C use two single quotes to delimit char literals instead of just one?

Does C really need two single quotes (apostrophes) to delimit char literals instead of just one?
For string literals we do need to delimit the start and the end since strings vary in length, but it seems to me that we do know how long a char literal will be: either a single character (in the source), two characters if it is a regular character escape (prefix \0), five characters if it is an octal literal (prefix \0[0-7]), etc.
Keep in mind that I am looking for a technical answer, not a historical one. Does it make parsing simpler? Did it make parsing simpler on 70s hardware? Does it allow for better parsing error messages? Things like that.
(The same question could be asked for most C syntax inspired languages since most of them seem to use the same syntax to delimit char literals. I think the Jai programming language might be an exception since I seem to recall that it just uses a single question mark (at the beginning), but I’m not certain.)
Some examples:
'G'
'\0'
'\0723'
Would it work if we just used a single quote at the start of the token?
'G
'\0
'\0723
Could we in principle parse these tokens the same way without complicating the grammar?
We see that the null byte literal and the octal literal have the same prefix, but there might not be any ambiguity since there might not be any way that '\0 followed immediately by 723 might be anything else than a char literal (at least to my mind). And if there is an ambiguity then the null byte literal could become \z instead.
Are the two single quotes needed in order to properly parse char literals?
cppreference.com says that multicharacter constants were inherited to C already from the B programming language, so probably have existed from the start. Since they can be of various widths, the ending quote is pretty much a requirement.
Apart from that and aesthetics in general, a character constant representing the space character in particular would look somewhat awkward and be a likely magnet for mistakes if it was just ' instead of ' '.
One answer (there might be more) might be that C99 supports multicharacter literals. See for example this SO question.
So for example 'left' is a valid (multi) char literal.
Once you have multichar literals you might not be able to just use a single quotation marks to delimit char literals. For example, how would you delimit the literal 'a c' with just one single quotation mark?
The meaning of such literals is implementation defined so I don’t know how widely-supported this feature is.
Why does C use two single quotes to delimit char literals instead of just one?
Because several historical predecessors of C (e.g. PL/1, and B and some dialects of Fortran or ALGOL) did so.
And because the C standard (e.g. n1570 or something newer) specifies that.
And perhaps because in the 1970s it was faster to parse (for most char literals like 'z' ....)

Restrictions to Unicode escape sequences in C11

Why is there a restriction for Unicode escape sequences (\unnnn and \Unnnnnnnn) in C11 such that only those characters outside of the basic character set may be represented? For example, the following code results in the compiler error: \u000A is not a valid universal character. (Some Unicode "dictionary" sites even give this invalid format as canon for the C/C++ languages, though admittedly these are likely auto-generated):
static inline int test_unicode_single() {
return strlen(u8"\u000A") > 1;
}
While I understand that it's not exactly necessary for these basic characters to supported, is there a technical reason why they're not? Something like not being able to represent the same character in more than one way?
It's precisely to avoid alternative spellings.
The primary motivations for adding Universal Character Names (UCNs) to C and C++ were to:
allow identifiers to include letters outside of the basic source character set (like ñ, for example).
allow portable mechanisms for writing string and character literals which include characters outside of the basic source character set.
Furthermore, there was a desire that the changes to existing compilers be as limited as possible, and in particular that compilers (and other tools) could continue to use their established (and often highly optimised) lexical analysis functions.
That was a challenge, because there are huge differences in the lexical analysis architectures of different compilers. Without going into all the details, it appeared that two broad implementation strategies were possible:
The compiler could internally use some single universal encoding, such as UTF-8. All input files in other encodings would be transcribed into this internal encoding very early in the input pipeline. Also, UCNs (wherever they appeared) would be converted to the corresponding internal encoding. This latter transformation could be conducted in parallel with continuation line processing, which also requires detecting backslashes, thus avoiding an extra test on every input character for a condition which very rarely turns out to be true.
The compiler could internally use strict (7-bit) ASCII. Input files in encodings allowing other characters would be transcribed into ASCII with non-ASCII characters converted to UCNs prior to any other lexical analysis.
In effect, both of these strategies would be implemented in Phase 1 (or equivalent), which is long before lexical analysis has taken place. But note the difference: strategy 1 converts UCNs to an internal character coding, while strategy 2 converts non-representable characters to UCNs.
What these two strategies have in common is that once the transcription is finished, there is no longer any difference between a character entered directly into the source stream (in whatever encoding the source file uses) and a character described with a UCN. So if the compiler allows UTF-8 source files, you could enter an ñ as either the two bytes 0xc3, 0xb1 or as the six-character sequence \u00D1, and they would both end up as the same byte sequence. That, in turn, means that every identifier has only one spelling, so no change is necessary (for example) to symbol table lookup.
Typically, compilers just pass variable names through the compilation pipeline, leaving them to be eventually handled by assemblers or linkers. If these downstream tools do not accept extended character encodings or UCNs (depending on implementation strategy) then names containing such characters need to be "mangled" (transcribed) in order to make them acceptable. But even if that's necessary, it's a minor change and can be done at a well-defined interface.
Rather than resolve arguments between compiler vendors whose products (or development teams) had clear preferences between the two strategies, the C and C++ standards committees chose mechanisms and restrictions which make both strategies compatible. In particular, both committees forbid the use of UCNs which represent characters which already have an encoding in the basic source character set. That avoids questions like:
What happens if I put \u0022 inside a string literal:
const char* quote = "\u0022";
If the compiler translates UCNs to the characters they represent, then by the time the lexical analyser sees that line, "\u0022" will have been converted to """, which is a lexical error. On the other hand, a compiler which retains UCNs until the end would happily accept that as a string literal. Banning the use of a UCN which represents a quotation mark avoids this possible non-portability.
Similarly, would '\u005cn' be a newline character? Again, if the UCN is converted to a backslash in Phase 1, then in Phase 3 the string literal would definitely be treated as a newline. But if the UCN is converted to a character value only after the character literal token has been identified as such, then the resulting character literal would contain two characters (an implementation-defined value).
And what about 2 \u002B 2? Is that going to look like an addition, even though UCNs aren't supposed to be used for punctuation characters? Or will it look like an identifier starting with a non-letter code?
And so on, for a large number of similar issues.
All of these details are avoided by the simple expedient of requiring that UCNs cannot be used to spell characters in the basic source character set. And that's what was embodied in the standards.
Note that the "basic source character set" does not contain every ASCII character. It does not contain the majority of the control characters, and nor does it contain the ASCII characters $, # and `. These characters (which have no meaning in a C or C++ program outside of string and character literals) can be written as the UCNs \u0024, \u0040 and \u0060 respectively.
Finally, in order to see what sort of knots you need to untie in order to correctly lexically analyse C (or C++), consider the following snippet:
const char* s = "\\
n";
Because continuation lines are dealt with in Phase 1, prior to lexical analysis, and Phase 1 only looks for the two-character sequence consisting of a backslash followed by a newline, that line is the same as
const char* s = "\n";
But that might not have been obvious looking at the original code.

I need help filtering bad words in C?

As you can see, I am trying to filter various bad words. I have some code to do so. I am using C, and also this is for a GTK application.
char LowerEnteredUsername[EnteredUsernameLen];
for(unsigned int i = 0; i < EnteredUsernameLen; i++) {
LowerEnteredUsername[i] = tolower(EnteredUsername[i]);
}
LowerEnteredUsername[EnteredUsernameLen+1] = '\0';
if (strstr(LowerEnteredUsername, (char[]){LetterF, LetterU, LetterC, LetterK})||strstr(LowerEnteredUsername, (char[]){LetterF, LetterC, LetterU, LetterK})) {
gtk_message_dialog_set_markup((GtkMessageDialog*)Dialog, "This username seems to be innapropriate.");
UsernameErr = 1;
}
My issue is, is that, it will only filter the last bad word specified in the if statement. In this example, "fcuk". If I input "fuck," the code will pass that as clean. How can I fix this?
(char[]){LetterF, LetterU, LetterC, LetterK}
(char[]){LetterF, LetterC, LetterU, LetterK}
You’ve forgotten to terminate your strings with a '\0'. This approach doesn’t seem to me to be very effective in keeping ~bad words~ out of source code, so I’d really suggest just writing regular string literals:
if (strstr(LowerEnteredUsername, "fuck") || strstr(LowerEnteredUsername, "fcuk")) {
Much clearer. If this is really, truly a no-go, then some other indirect but less error-prone ways are:
"f" "u" "c" "k"
or
#define LOWER_F "f"
#define LOWER_U "u"
#define LOWER_C "c"
#define LOWER_K "k"
and
LOWER_F LOWER_U LOWER_C LOWER_K
Doing human-language text processing in C is painful because C's concept of strings (i.e. char*/char[] and wchar_t*/wchar_t[]) are very low-level and are not expressive enough to easily represent Unicode text, let alone locate word-boundaries in text and match words in a known dictionary (also consider things like inflection, declension, plurals, the use of diacritics to evade naive string matching).
For example - your program would need to handle George carlin's famous Seven dirty words quote:
https://www.youtube.com/watch?v=vbZhpf3sQxQ
Someone was quite interested in these words. They kept referring to them: they called them bad, dirty, filthy, foul, vile, vulgar, coarse, in poor taste, unseemly, street talk, gutter talk, locker room language, barracks talk, bawdy, naughty, saucy, raunchy, rude, crude, lude, lascivious, indecent, profane, obscene, blue, off-color, risqué, suggestive, cursing, cussing, swearing... and all I could think of was: shit, piss, fuck, cunt, cocksucker, motherfucker, and tits!
This could be slightly modified to evade a naive filter, like so:
Someone was quite interested in these words. They kept referring to them: they called them bad, dirty, filthy, foul, vile, vulgar, coarse, in poor taste, unseemly, street talk, gutter talk, locker room language, barracks talk, bawdy, naughty, saucy, raunchy, rude, crude, lude, lascivious, indecent, profane, obscene, blue, off-color, risqué, suggestive, cursing, cussing, swearing... and all I could think of was: shít, pis$, phuck, c​unt, сocksucking, motherfúcker, and títs!
Above, some of the words have simple replacements done, like s to $, others had diacritics added like u to ú, and some are just homonyms), however some of the other words in the above look the same but actually contain homographs or "invisible" characters like Unicode's zero-width-space, so they would evade naive text matching systems.
So in short: Avoid doing this in C. if you must, then use a robust and fully-featured Unicode handling library (i.e. do not use the C Standard Library's string functions like strstr, strtok, strlen, etc).
Here's how I would do it:
Read in input to a binary blob containing Unicode text (presumably UTF-8).
Use a Unicode library to:
Normalize the encoded Unicode text data (see https://en.wikipedia.org/wiki/Unicode_equivalence )
Identify word boundaries (assuming we're dealing with European-style languages that use sentences comprised of words).
Use a linguistics library and database (English alone is full of special-cases) to normalize each word to some singular canonical form.
Then lookup each morpheme in a case-insensitive hash-set of known "bad words".
Now, there are a few shortcuts you can take:
You can use regular-expressions to identify word-boundaries.
There exist Unicode-aware regular-expression libraries for C, for example PCRE2: http://www.pcre.org/current/doc/html/pcre2unicode.html
You can skip normalizing each word's inflections/declensions if you're happy with having to list those in your "bad word" list.
I would write working code for this example, but I'm short on time tonight (and it would be a LOT of code), but hopefully this answer provides you with enough information to figure out the rest yourself.
(Pro-tip: don't match strings in a list by checking each character - it's slow and inefficient. This is what hashtables and hashsets are for!)

Difficulty with my lexical analyzer

I'm trying to program a lexical analyzer to a standard C translation unit, so I've divided the possible tokens into 6 groups; for each group there's a regular expression, which will be converted to a DFA:
Keyword - (will have a symbol table containing "goto", "int"....)
Identifers - [a-zA-z][a-zA-Z0-9]*
Numeric Constants - [0-9]+/.?[0-9]*
String Constants - ""[EVERY_ASCII_CHARACTER]*""
Special Symbols - (will have a symbol table containing ";", "(", "{"....)
Operators - (will have a symbol table containing "+", "-"....)
My Analyzer's input is a stream of bytes/ASCII characters. My algorithm is the following:
assuming there's a stream of characters, x1...xN
foreach i=1, i<=n, i++
if x1...xI accepts one or more of the 6 group's DFA
{
take the longest-token
add x1...xI to token-linked-list
delete x1...xI from input
}
However, this algorithm will assume that every byte it is given, which is a letter, is an identifier, since after an input of 1 character, it accepts the DFA of the identifiers tokens ([a-zA-Z][a-zA-Z0-9]*).
Another possible problem is for the input "intx;", my algorithm will tokenize this stream into "int", "x", ";" which of course is an error.
I'm trying to think about a new algorithm, but I keep failing. Any suggestions?
Code your scanner so that it treats identifiers and keywords the same until the reading is finished.
When you have the complete token, look it up in the keyword table, and designate it a keyword if you find it and as an identifier if you don't find it. This deals with the intx problem immediately; the scanner reads intx and that's not a keyword so it must be be an identifier.
I note that your identifiers don't allow underscores. That's not necessarily a problem, but many languages do allow underscores in identifiers.
Tokenizers generally FIRST split the input stream into tokens, based on rules which dictate what constitute an END of token, and only later decide what kind of token it is (or an error otherwise). Typical end of token are things like white space (when not part of literal string), operators, special delimiters, etc.
It seems you are missing the greediness aspect of competing DFAs. greedy matching is usually the most useful (left-most longest match) because it solves the problem of how to choose between competing DFAs. Once you've matched int you have another node in the IDENTIFIER DFA that advances to intx. Your finate automata doesn't exit until it reaches something it can't consume, and if it isn't in a valid accept state at the end of input, or at the point where another DFA is accepting, it is pruned and the other DFA is matched.
Flex, for example, defaults to greedy matching.
In other words, your proposed problem of intx isn't a problem...
If you have 2 rules that compete for int
rule 1 is the token "int"
rule 2 is IDENTIFIER
When we reach
i n t
we don't immediately ACCEPT int because we see another rule (rule 2) where further input x progresses the automata to a NEXT state:
i n t x
If rule 2 is in an ACCEPT state at that point, then rule 1 is discarded by definition. But if rule 2 is still not in ACCEPT state, we must keep rule 1 around while we examine more input to see if we could eventually reach an ACCEPT state in rule 2 that is longer than rule 1. If we receive some other character that matches neither rule, we check if rule 2 automata is in an ACCEPT state for intx, if so, it is the match. If not, it is discarded, and the longest previous match (rule 1) is accepted, however in this case, rule 2 is in ACCEPT state and matches intx
In the case that 2 rules reach an ACCEPT or EXIT state simultaneously, then precedence is used (order of the rule in the grammar). Generally you put your keywords first so IDENTIFIER doesn't match first.

Does the comma operator in an array have a name?

I was just wondering if any programming language, organization, or computer scientist had ever given a name for the comma operator or equivalent separator when used in an array?
["Do", "the", "commas", "here", "have", "a", "name"]?
i.e. separators, next, continue, etc.?
As per the comments: the comma in the example is not an operator. It is a list separator and where it appears in a grammar it is typically referred to as list separator, separator or just `comma'.
It is used with this meaning and terminology in at least 90% of the languages I've used (which is quite a few). There are languages with different separators or additional separators, including white space or just about any punctuation character you can think of, but no original names for them as far as I recall.
I do not rule out the possibility that some creative person has called it something different. If not, feel free to be the first.

Resources