I'm looking for how can I write identifiers name with characters like [ ' " or #.
Everytime that I try to do that, I give the error:
error: macro names must be identifiers
But learning about gcc, I found this option:
-fextended-identifiers
But it seems not working like I wanted, please, somebody know how to accomplish that?
Identifiers can't include such characters. It is defined that way in the language syntax, identifiers are letters, digits or underline (and mustn't begin with a digit to avoid ambiguity with litteral numbers).
If it was possible this would conflict with the C compiler (that uses [ for arrays) and C preprocessor syntax (that uses #). Extended identifiers extension only allow using characters non forbidden by the language syntax inside identifiers (basically unicode foreign letters, etc.).
But if you really, really want to do this, nothings forbids you to preprocess your source files with your own "extended macro preprocessor", practically creating a new "C like" language. That looks like a terrible idea, but it's not really hard to do. Then you'll see soon enough by yourself why it's not a good idea...
According to this link, -fextended-identifiers only enables UTF-8 support for identifiers, so it won't help in your case.
So, answer is: You can't use such characters in macro identifiers.
Even if the extended identifier characters support was fully enabled, it wouldn't help you get characters such as:
[ ' " #
enabled for identifiers. The standard allows 'universal character names' or 'other implementation-defined characters' to be part of an identifier, but they cannot be part of the basic character set. Out of the basic character set, only _, letters and digits can be part of an identifier name (6.4.2.1 Identifiers/General).
Related
I stumbled on some C++ code like this:
int $T$S;
First I thought that it was some sort of PHP code or something wrongly pasted in there but it compiles and runs nicely (on MSVC 2008).
What kind of characters are valid for variables in C++ and are there any other weird characters you can use?
The only legal characters according to the standard are alphanumerics
and the underscore. The standard does require that just about anything
Unicode considers alphabetic is acceptable (but only as single
code-point characters). In practice, implementations offer extensions
(i.e. some do accept a $) and restrictions (most don't accept all of the
required Unicode characters). If you want your code to be portable,
restrict symbols to the 26 unaccented letters, upper or lower case, the
ten digits, and the '_'.
It's an extension of some compilers and not in the C standard
MSVC:
Microsoft Specific
Only the first 2048 characters of Microsoft C++ identifiers are significant. Names for user-defined types are "decorated" by the compiler to preserve type information. The resultant name, including the type information, cannot be longer than 2048 characters. (See Decorated Names for more information.) Factors that can influence the length of a decorated identifier are:
Whether the identifier denotes an object of user-defined type or a type derived from a user-defined type.
Whether the identifier denotes a function or a type derived from a function.
The number of arguments to a function.
The dollar sign is also a valid identifier in Visual C++.
// dollar_sign_identifier.cpp
struct $Y1$ {
void $Test$() {}
};
int main() {
$Y1$ $x$;
$x$.$Test$();
}
https://web.archive.org/web/20100216114436/http://msdn.microsoft.com/en-us/library/565w213d.aspx
Newest version: https://learn.microsoft.com/en-us/cpp/cpp/identifiers-cpp?redirectedfrom=MSDN&view=vs-2019
GCC:
6.42 Dollar Signs in Identifier Names
In GNU C, you may normally use dollar signs in identifier names. This is because many traditional C implementations allow such identifiers. However, dollar signs in identifiers are not supported on a few target machines, typically because the target assembler does not allow them.
http://gcc.gnu.org/onlinedocs/gcc/Dollar-Signs.html#Dollar-Signs
In my knowledge only letters (capital and small), numbers (0 to 9) and _ are valid for variable names according to standard (note: the variable name should not start with a number though).
All other characters should be compiler extensions.
This is not good practice. Generally, you should only use alphanumeric characters and underscores in identifiers ([a-z][A-Z][0-9]_).
Surface Level
Unlike in other languages (bash, perl), C does not use $ to denote the usage of a variable. As such, it is technically valid. In C it most likely falls under C11, 6.4.2. This means that it does seem to be supported by modern compilers.
As for your C++ question, lets test it!
int main(void) {
int $ = 0;
return $;
}
On GCC/G++/Clang/Clang++, this indeed compiles, and runs just fine.
Deeper Level
Compilers take source code, lex it into a token stream, put that into an abstract syntax tree (AST), and then use that to generate code (e.g. assembly/LLVM IR). Your question really only revolves around the first part (e.g. lexing).
The grammar (thus the lexer implementation) of C/C++ does not treat $ as special, unlike commas, periods, skinny arrows, etc... As such, you may get an output from the lexer like this from the below c code:
int i_love_$ = 0;
After the lexer, this becomes a token steam like such:
["int", "i_love_$", "=", "0"]
If you where to take this code:
int i_love_$,_and_.s = 0;
The lexer would output a token steam like:
["int", "i_love_$", ",", "_and_", ".", "s", "=", "0"]
As you can see, because C/C++ doesn't treat characters like $ as special, it is processed differently than other characters like periods.
I am trying to understand when a developer needs to define a C variable with preceding '_'. What is the reason for it?
For example:
uint32_t __xyz_ = 0;
Maybe this helps, from C99, 7.1.3 ("Reserved Identifiers"):
All identifiers that begin with an underscore and either an uppercase letter or another
underscore are always reserved for any use.
All identifiers that begin with an underscore are always reserved for use as identifiers
with file scope in both the ordinary and tag name spaces.
Moral: For ordinary user code, it's probably best not to start identifiers with an underscore.
(On a related note, I think you should also stay clear from naming types with a trailing _t, which is reserved for standard types.)
It is a trick used in the header files of C implementations for global symbols, in order to prevent eventual conflicts with other symbols defined by the user.
Since C lacks a namespace feature, this is a rudimentary approach to avoid name collisions with the user.
Declaring such symbols in your own header and source files is not encouraged because it can introduce naming conflicts between your code and the C implementation. Even if that doesn't produce a conflict on your current implementation, you are still prone to strange conflicts across different/future implementations, since they are free to use other symbols prefixed with underscores.
whether its C or not, the leading underscore provides the programmer a status indication so he does not have to go look it up. In PHP, or any object oriented language where we deal with tens of thousands of properties and methods written by 1000's of authors, seeing an underscore prefix removes the need to go dig through the class andlook up whether its declared private, or protected or public. thats an immense time saver. the practice started before C, i am sure...
I was looking at the runtime.c file in the go runtime at
/usr/local/go/src/pkg/runtime
and saw the following function definitions:
void
runtime∕pprof·runtime_cyclesPerSecond(int64 res)
{...}
and
int64
runtime·tickspersecond(void)
{...}
and there are a lot of declarations like
void runtime·hashinit(void);
in the runtime.h.
I haven't seen this C syntax before (specially the one with the slash seems odd).
Is this part of std C or some plan9 dialect?
It's Go's special internal syntax for Go package paths. For example,
runtime∕pprof·runtime_cyclesPerSecond
is function runtime_cyclesPerSecond in package path runtime∕pprof.
The '∕' character is the Unicode division slash character, which separates path elements. The '·' character is the Unicode middle dot character, which separates the package path and the function.
∕ and · and friends are merely random Unicode characters that someone decided to put in function names. Obscure Unicode characters (edit: that are listed in Annex D of the C99 standard (pages 452-453 of this PDF); see also here) are just as legal in C identifiers as A or 7 (in your average Unicode-capable compiler, anyway).
Char| Hex| Octal|Decimal|Windows Alt-code
----+------+------+-------+----------------
∕ |0x2215|021025| 8725| (null)
· | 0xB7| 0267| 183| Alt+0183
Putting characters that look like operators but aren't (U+2215 ∕, in particular, resembles U+2F / (division) far too closely) in function names can be a confusing practice, so I would personally advise against it. Obviously someone on the Go team decided that whatever reasons they had for including them in function names outweighed the potential for confusion.
(Edit: It should be noted that U+2215 ∕ isn't expressly permitted by Annex D. As discussed here, this may be an extension.)
How can I match a word (1-n characters) in ANSI C? (in addition: What is the pattern to match a constant in C-sourcecode?)
I tried reading the file and passing it to regexec() (regex.h).
Problem: The tool I'm writing should be able to read sourcecode and find
all used constants (#define) to check if they're defined.
The pattern used for testing is: [a-zA-Z_0-9]{1,}. But this would match words such as the "h" in "test.h".
Identifiers must start with a letter or underscore, so the pattern is
[A-Za-z_][A-Za-z0-9_]*
I know of no syntactic difference between C and preprocessor identifiers. There is a convention to use upper case for preprocessor and lowercase for C identifiers, but no actual requirement. Unless defines are guaranteed to use a distinct naming convention you would basically have to find every identifier in the source file and any included files and sort them into preprocessor identifiers, C identifiers and undeclared identifiers.
From the GCC manual:
Preprocessing tokens fall into five broad classes: identifiers, preprocessing numbers, string literals, punctuators, and other. An identifier is the same as an identifier in C: any sequence of letters, digits, or underscores, which begins with a letter or underscore. Keywords of C have no significance to the preprocessor; they are ordinary identifiers. You can define a macro whose name is a keyword, for instance. The only identifier which can be considered a preprocessing keyword is defined.
Another option besides doing regex searches over C source code would be to use a preprocessor library like Boost Wave or perhaps something like Coan instead of starting from scratch.
Here is the Lexer grammar and the Parser grammar (in flex and bison format, respectively) for the entire c language. In particular, the part relevant to identifiers is:
D [0-9]
L [a-zA-Z_]
{L}({L}|{D})* { count(); return(check_type()); }
So the id can start with any uppercase or lowercase letter or an underscore, and then have more uppercase or lowercase letters, underscores, and numbers. I believe it doesn't match parts of file names because they're quoted and it handles quotes separately.
I am doing a simple program that should count the occurrences of ternary operator ?: in C source code. And I am trying to simplify that as much as it is possible. So I've filtered from source code these things:
String literals " "
Character constants ' '
Trigraph sequences ??=, ??(, etc.
Comments
Macros
And now I am only counting the occurances of questionmarks.
So my question question is: Is there any other symbol, operator or anything else what could cause problem - contain '?' ?
Let's suppose that the source is syntax valid.
I think you found all places where a question-mark is introduced and therefore eliminated all possible false-positives (for the ternary op). But maybe you eliminated too much: Maybe you want to count those "?:"'s that get introduced by macros; you dont count those. Is that what you intend? If that's so, you're done.
Run your tool on preprocessed source code (you can get this by running e.g. gcc -E). This will have done all macro expansions (as well as #include substitution), and eliminated all trigraphs and comments, so your job will become much easier.
In K&R ANSI C the only places where a question mark can validly occur are:
String literals " "
Character constants ' '
Comments
Now you might notice macros and trigraph sequences are missing from this list.
I didn't include trigraph sequences since they are a compiler extension and not "valid C". I don't mean you should remove the check from your program, I'm trying to say you already went further then what's needed for ANSI C.
I also didn't include macros because when you're talking about a character that can occur in macros you can mean two things:
Macro names/identifiers
Macro bodies
The ? character can not occur in macro identifiers (http://stackoverflow.com/questions/369495/what-are-the-valid-characters-for-macro-names), and I see macro bodies as regular C code so the first list (string literals, character constants and comments*) should cover them too.
* Can macros validly contain comments? Because if I use this:
#define somemacro 15 // this is a comment
then // this is a comment isn't part of the macro. But what if I would compiler this C file with -D somemacro="15 // this is a comment"?