Does the comma operator in an array have a name? - arrays

I was just wondering if any programming language, organization, or computer scientist had ever given a name for the comma operator or equivalent separator when used in an array?
["Do", "the", "commas", "here", "have", "a", "name"]?
i.e. separators, next, continue, etc.?

As per the comments: the comma in the example is not an operator. It is a list separator and where it appears in a grammar it is typically referred to as list separator, separator or just `comma'.
It is used with this meaning and terminology in at least 90% of the languages I've used (which is quite a few). There are languages with different separators or additional separators, including white space or just about any punctuation character you can think of, but no original names for them as far as I recall.
I do not rule out the possibility that some creative person has called it something different. If not, feel free to be the first.

Related

Splitting string in C by blank spaces, besides when said blank space is within a set of quotes

I'm writing a simple Lisp in C without any external dependencies (please do not link the BuildYourOwnLisp), and I'm following this guide as a basis to parse the Lisp. It describes two steps in tokenising a S-exp, those steps being:
Put spaces around every paranthesis
Split on white space
The first step is easy enough, I wrote a trivial function that replaces certain substrings with other substrings, but I'm having problems with the second step. In the article it only uses the string "Lisp" in its examples of S-exps; if I were to use strtok() to blindly split by whitespace, any string in a S-exp that had a space within it would become fragmented and interpreted incorrectly by my Lisp. Obviously, a language limited to single-word strings isn't very useful.
How would I write a function that splits a string by white space, besides when the text is in between two double quotes?
I've tried using regex, but from what I can see of the POSIX regex.h library and PCRE, just extracting the matches would be incredibly laborious in terms of the amount of auxillary code I'd have to write, which would only serve to bloat my codebase. Besides, one of my goals with this project was to use only ANSI C, or, if need be, C99, solely for the sake of portability - fiddling with the POSIX library and the Win32 API would just fatten my code and make moving my lisp around a nightmare.
When researching this problem I came across this StackOverflow answer; but the approved answer only sends the tokenised string onto stdout, which isn't useful for me; I'd ideally have the tokens in a char** so that I could then parse them into useful in memory data structures.
As well as this, the approved answer on the aforementioned SO question is written to be restricted to specifically my problem - ideally, I'd have myself a general purpose function that would allow me to tokenise a string, except when a substring is between two of charachter x. This isn't a huge deal, it's just that I'd like my codebase to be clean and composable.
You have two delimiters: the space and double quotes.
You can use the strcspn (or with example: cppreference - strcspn) function for that.
Iterate over the string and look for the delimiters (space and quotes). strcspn returns if such a delimiter was found. If a space was found, continue looking for both. If a double quote was found, the delimiter chages from " \"" (space and quotes) to "\"" (double quotes). If you then hit the quotes again, change the delimiter to " \"" (space and quotes).
Based on your comment:
Lets say you have a string like
This is an Example.
The output would be
This
is
an
Example.
If the string would look like
This "is an" Example.
The output would be
This
is an
Example.

Why does C use two single quotes to delimit char literals instead of just one?

Does C really need two single quotes (apostrophes) to delimit char literals instead of just one?
For string literals we do need to delimit the start and the end since strings vary in length, but it seems to me that we do know how long a char literal will be: either a single character (in the source), two characters if it is a regular character escape (prefix \0), five characters if it is an octal literal (prefix \0[0-7]), etc.
Keep in mind that I am looking for a technical answer, not a historical one. Does it make parsing simpler? Did it make parsing simpler on 70s hardware? Does it allow for better parsing error messages? Things like that.
(The same question could be asked for most C syntax inspired languages since most of them seem to use the same syntax to delimit char literals. I think the Jai programming language might be an exception since I seem to recall that it just uses a single question mark (at the beginning), but I’m not certain.)
Some examples:
'G'
'\0'
'\0723'
Would it work if we just used a single quote at the start of the token?
'G
'\0
'\0723
Could we in principle parse these tokens the same way without complicating the grammar?
We see that the null byte literal and the octal literal have the same prefix, but there might not be any ambiguity since there might not be any way that '\0 followed immediately by 723 might be anything else than a char literal (at least to my mind). And if there is an ambiguity then the null byte literal could become \z instead.
Are the two single quotes needed in order to properly parse char literals?
cppreference.com says that multicharacter constants were inherited to C already from the B programming language, so probably have existed from the start. Since they can be of various widths, the ending quote is pretty much a requirement.
Apart from that and aesthetics in general, a character constant representing the space character in particular would look somewhat awkward and be a likely magnet for mistakes if it was just ' instead of ' '.
One answer (there might be more) might be that C99 supports multicharacter literals. See for example this SO question.
So for example 'left' is a valid (multi) char literal.
Once you have multichar literals you might not be able to just use a single quotation marks to delimit char literals. For example, how would you delimit the literal 'a c' with just one single quotation mark?
The meaning of such literals is implementation defined so I don’t know how widely-supported this feature is.
Why does C use two single quotes to delimit char literals instead of just one?
Because several historical predecessors of C (e.g. PL/1, and B and some dialects of Fortran or ALGOL) did so.
And because the C standard (e.g. n1570 or something newer) specifies that.
And perhaps because in the 1970s it was faster to parse (for most char literals like 'z' ....)

Use special characters other than "\n" and "\0" in C

I have one question.
I'm writing some code in C, on UNIX.
I need to write a special character in a file, because I need to divide my file in small sections.
Example:
'SPECIAL_CHARACTER'
section 1 with some text
'SPECIAL_CHARACTER'
section 2 with some text
etc..
I was thinking to use character '\1'.It seems to work, but it is ok? Or It is wrong?
To do these things without using characters like "\0" or "\n" what should I do?
I hear two different questions where you ask "Or It is wrong?"
I hear you asking "how can I designate a separator byte in my code?", and I hear you asking "what is a good choice for a separator byte?"
First, fundamentally, what you are asking about is covered in section 6.4.4.4 of the C language specification, which covers "C Character Constants". There are various places you can look up the formal C language spec, or you can search for "C Character Constants" for perhaps a friendlier description, etc.
In detail, a handful of letters can be used in escape sequences to stand in for single bytes of specific values; e.g., \n is one of those, as a stand-in for 0x0a (decimal 10), a byte designated (in ASCII) as a newline. Here are the legal ones:
\a \b \f \n \r \t \v
The escape sequences \0 and \1 work because C supports using \ followed by digits as an octal value. So, that'll also work with, say, \3 and \35, but not \9, and note that \35 has a decimal value of 29. (Google "octal values" if you don't immediately see why that's the case.)
There are other legal escape sequences:
\' \" \\ \? : ' " \ and ?, respectively
\xNNNN... : each 'N' can be a hexadecimal digit
And, of course, escape sequences are just one aspect of C character constants.
Second, whether or not you should use a given byte value as your file's section separator depends entirely on how your program will be used. As others have pointed out in the comments, there are commonplace prevailing practices on what sort of byte value to use for this sort of thing.
I personally agree that 0x1e makes perhaps the most sense since in ASCII it is the "record separator". Conforming to ASCII can matter if the data will need to be understood by other programs, or if your program will need to be understood by other people.
On the other hand, a simple code comment can make it clear to anyone reading your code what byte value you are using for separating sections of your data file, and any program that needs to understand your data files needs to 'know' a lot more about the file format than just what the record separator is. There is nothing magical about 0x1e : it is merely a convention, and a reserved spot on the ASCII table to facilitate a common need -- that is, record separation of text that could contain normal text separators like space, newline, and null.
Broadly, any byte value that won't show up in the contents of your sections would make a fine section separator. Since you say those contents will be text, there are well over 100 choices, even if you exclude \0 (0x00) and \n (0x0a). In ASCII, a handful of byte values have been set aside for this sort of purpose, so that helps reduce the choice from several dozen to just several. Even among those several, there are only a few commonly used as separators.

What will be number of tokens(compiler)?

What will be number of tokens in following ?
int a[2][3];
I think tokens are -> {'int', '[', ']', '[', ']', ';'}
Can someone explain what to consider and what not while compiler calculates tokens ?
Thanks
Expanding on my comment:
How the input is tokenized is a function of your tokenizer (scanner). In principle, the input you presented might be tokenized as "int", "a", "[2]", "[3]", ";", for example. In practice, the most likely choice of tokenization would be "int", "a", "[", "2", "]", "[", "3", "]", ";". I am uncertain why you seem to think that the variable name and dimension values would not be represented among the tokens -- they carry semantic information and therefore must not be left out.
Although separating compiling into a lexical analysis step and a semantic analysis step is common and widely considered useful, it is not inherently essential to make such a separation at all. Where it is made, the choice of tokenization is up to the compiler. One ordinarily chooses tokens so that each represents a semantically significant unit, but there is more than one way to do that. For instance, my alternative example corresponds to a token sequence that might be characterized as
IDENTIFIER, IDENTIFIER, DIMENSION, DIMENSION, TERMINATOR
The more likely approach might be characterized as
IDENTIFIER, IDENTIFIER, OPEN_BRACKET, INTEGER, CLOSE_BRACKET, OPEN_BRACKET,
INTEGER, CLOSE_BRACKET, TERMINATOR
The questions to consider include
What units of the source contain meaningful semantic information in their own right? For instance, it is not useful to make each character a separate token or to split up int into two tokens, because such tokens do not represent a complete semantic unit.
How much responsibility you can or should put on the lexical analyzer (for instance, to understand the context enough to present DIMENSION instead of OPEN_BRACKET, INTEGER, CLOSE_BRACKET)
Updated to add:
The C standard does define the post-preprocessing language in terms of a specific tokenization, which for the statement you gave would be the "most likely" alternative I specified (and that's one reason why it's the most likely). I have answered the question in a more general sense, however, in part because it is tagged [compiler-construction].

What is this character ` called?

I never noticed the character ` (the one in the same key as tilde ~). There is another single quote character ' in the same key as ". I see that the characters ` and ' aren't interchangeable whereas ' and " are.
I spent a lot of time due to that when compiling GTK programs. It gave error (file not found), and finally figured out that its not a single quote.
What is the purpose of this ` character and when is it (or when should it) be used?
Thanks.
It's typically called a "backtick", and in bash, it is used for command substitution (although the $(cmd) construct is usually preferred due to easier nesting).
` is known variously known as a backtick, or grave accent (See http://en.wikipedia.org/wiki/Grave_accent).
In UNIX shells, as well as some scripting languages (Ruby, Perl...), it introduces some input to be executed in a subshell. In C and C++, it has no special purpose but can be inserted as a character literal, or part of a string literal. One reason it's not used for something more interested in the extremely wide portability of the languages spans machines where the character can't be expected to be on the keyboards, and may not display very differently from the single-right-quote "'" on screen and printouts, making for extremely hard-to-see bugs.
In some word-processing and similar application programs, typing a backtick will insert a single left quote character "‘". Commonly keyboard input software will allow a user to type say "e" in order to enter the character "è", or "a` for "à" etc, as used in some languages' alphabets.
I call it a "grave", as in a grave accent
In MySQL, it's used to surround identifiers when they might otherwise be ambiguous (such as using a reserved word as a table or column name). There are going to be lots of different uses of that character in lots of different pieces of software, just as there are for the other keys on the keyboard.
In a few languages, including PHP, Perl, and i think Ruby, backticks execute shell commands.
http://php.net/manual/en/language.operators.execution.php
The SQL thing mentioned is another use, which unfortunately I am well aware of because of co-workers who decided 'Desc' was a good name for a field

Resources