Matching words in ANSI C - c

How can I match a word (1-n characters) in ANSI C? (in addition: What is the pattern to match a constant in C-sourcecode?)
I tried reading the file and passing it to regexec() (regex.h).
Problem: The tool I'm writing should be able to read sourcecode and find
all used constants (#define) to check if they're defined.
The pattern used for testing is: [a-zA-Z_0-9]{1,}. But this would match words such as the "h" in "test.h".

Identifiers must start with a letter or underscore, so the pattern is
[A-Za-z_][A-Za-z0-9_]*
I know of no syntactic difference between C and preprocessor identifiers. There is a convention to use upper case for preprocessor and lowercase for C identifiers, but no actual requirement. Unless defines are guaranteed to use a distinct naming convention you would basically have to find every identifier in the source file and any included files and sort them into preprocessor identifiers, C identifiers and undeclared identifiers.
From the GCC manual:
Preprocessing tokens fall into five broad classes: identifiers, preprocessing numbers, string literals, punctuators, and other. An identifier is the same as an identifier in C: any sequence of letters, digits, or underscores, which begins with a letter or underscore. Keywords of C have no significance to the preprocessor; they are ordinary identifiers. You can define a macro whose name is a keyword, for instance. The only identifier which can be considered a preprocessing keyword is defined.

Another option besides doing regex searches over C source code would be to use a preprocessor library like Boost Wave or perhaps something like Coan instead of starting from scratch.

Here is the Lexer grammar and the Parser grammar (in flex and bison format, respectively) for the entire c language. In particular, the part relevant to identifiers is:
D [0-9]
L [a-zA-Z_]
{L}({L}|{D})* { count(); return(check_type()); }
So the id can start with any uppercase or lowercase letter or an underscore, and then have more uppercase or lowercase letters, underscores, and numbers. I believe it doesn't match parts of file names because they're quoted and it handles quotes separately.

Related

Why Can Variables Begin With $ In C? And Does It Mean Anything? [duplicate]

I stumbled on some C++ code like this:
int $T$S;
First I thought that it was some sort of PHP code or something wrongly pasted in there but it compiles and runs nicely (on MSVC 2008).
What kind of characters are valid for variables in C++ and are there any other weird characters you can use?
The only legal characters according to the standard are alphanumerics
and the underscore. The standard does require that just about anything
Unicode considers alphabetic is acceptable (but only as single
code-point characters). In practice, implementations offer extensions
(i.e. some do accept a $) and restrictions (most don't accept all of the
required Unicode characters). If you want your code to be portable,
restrict symbols to the 26 unaccented letters, upper or lower case, the
ten digits, and the '_'.
It's an extension of some compilers and not in the C standard
MSVC:
Microsoft Specific
Only the first 2048 characters of Microsoft C++ identifiers are significant. Names for user-defined types are "decorated" by the compiler to preserve type information. The resultant name, including the type information, cannot be longer than 2048 characters. (See Decorated Names for more information.) Factors that can influence the length of a decorated identifier are:
Whether the identifier denotes an object of user-defined type or a type derived from a user-defined type.
Whether the identifier denotes a function or a type derived from a function.
The number of arguments to a function.
The dollar sign is also a valid identifier in Visual C++.
// dollar_sign_identifier.cpp
struct $Y1$ {
void $Test$() {}
};
int main() {
$Y1$ $x$;
$x$.$Test$();
}
https://web.archive.org/web/20100216114436/http://msdn.microsoft.com/en-us/library/565w213d.aspx
Newest version: https://learn.microsoft.com/en-us/cpp/cpp/identifiers-cpp?redirectedfrom=MSDN&view=vs-2019
GCC:
6.42 Dollar Signs in Identifier Names
In GNU C, you may normally use dollar signs in identifier names. This is because many traditional C implementations allow such identifiers. However, dollar signs in identifiers are not supported on a few target machines, typically because the target assembler does not allow them.
http://gcc.gnu.org/onlinedocs/gcc/Dollar-Signs.html#Dollar-Signs
In my knowledge only letters (capital and small), numbers (0 to 9) and _ are valid for variable names according to standard (note: the variable name should not start with a number though).
All other characters should be compiler extensions.
This is not good practice. Generally, you should only use alphanumeric characters and underscores in identifiers ([a-z][A-Z][0-9]_).
Surface Level
Unlike in other languages (bash, perl), C does not use $ to denote the usage of a variable. As such, it is technically valid. In C it most likely falls under C11, 6.4.2. This means that it does seem to be supported by modern compilers.
As for your C++ question, lets test it!
int main(void) {
int $ = 0;
return $;
}
On GCC/G++/Clang/Clang++, this indeed compiles, and runs just fine.
Deeper Level
Compilers take source code, lex it into a token stream, put that into an abstract syntax tree (AST), and then use that to generate code (e.g. assembly/LLVM IR). Your question really only revolves around the first part (e.g. lexing).
The grammar (thus the lexer implementation) of C/C++ does not treat $ as special, unlike commas, periods, skinny arrows, etc... As such, you may get an output from the lexer like this from the below c code:
int i_love_$ = 0;
After the lexer, this becomes a token steam like such:
["int", "i_love_$", "=", "0"]
If you where to take this code:
int i_love_$,_and_.s = 0;
The lexer would output a token steam like:
["int", "i_love_$", ",", "_and_", ".", "s", "=", "0"]
As you can see, because C/C++ doesn't treat characters like $ as special, it is processed differently than other characters like periods.

What are the rules about using an underscore in a C identifier?

It's common in C (and other languages) to use prefixes and suffixes for names of variables and functions. Particularly, one occasionally sees the use of underscores, before or after a "proper" identifier, e.g. _x and _y variables, or _print etc. But then, there's also the common wisdom of avoiding names starting with underscore, so as to not clash with the C standard library implementation.
So, where and where is it ok to use underscores?
Good-enough rule of thumb
Don't start your identifier with an underscore.
That's it. You might still have a conflict with some file-specific definitions (see below), but those will just get you an error message which you can take care of.
Safe, slightly restrictive, rule of thumb
Don't start your identifier with:
An underscore.
Any 1-3 letter prefix, followed by an underscore, which isn't a proper word (e.g. a_, st_)
memory_ or atomic_.
and don't end your identifier with either _MIN or _MAX.
These rules forbid a bit more than what is actually reserved, but are relatively easy to remember.
More detailed rules
This is based on the C2x standard draft (and thus covers previous standards' reservations) and the glibc documentation.
Don't use:
The prefix __ (two underscores).
A prefix of one underscore followed by a capital letter (e.g. _D).
For identifiers visible at file scope - the prefix _.
The following prefixes with underscores, when followed by a lowercase letter: atomic_, memory_, memory_order_, cnd_, mtx_, thrd_, tss_
The following prefixes with underscores, when followed by an uppercase ltter : LC_, SIG_, ATOMIC, TIME_
The suffix _t (that's a POSIX restriction; for C proper, you can use this suffix unless your identifier begins with int or uint)
Additional restrictions are per-library-header-file rather than universal (some of these are POSIX restrictions):
If you use header file...
You can't use identifiers with ...
dirent.h
Prefix d_
fcntl.h
Prefixes l_, F_, O_, and S_
grp.h
Prefix gr_
limits.h
Suffix _MAX (also probably _MIN)
pwd.h
Prefix pw_
signal.h
Prefixes sa_ and SA_
sys/stat.h
Prefixes st_ and S_
sys/times.h
Prefix tms_
termios.h
Prefix c_
And there are additional restrictions not involving underscores of course.
The C standard, library chapter, reserves certain identifiers (emphasis mine):
C17 7.1.3 Reserved identifiers
— All identifiers that begin with an underscore and either an uppercase letter or another
underscore are always reserved for any use.
— All identifiers that begin with an underscore are always reserved for use as identifiers
with file scope in both the ordinary and tag name spaces.
— Each macro name in any of the following subclauses (including the future library
directions) is reserved for use as specified if any of its associated headers is included;
unless explicitly stated otherwise (see 7.1.4).
— All identifiers with external linkage in any of the following subclauses (including the
future library directions) and errno are always reserved for use as identifiers with
external linkage.184)
— Each identifier with file scope listed in any of the following subclauses (including the
future library directions) is reserved for use as a macro name and as an identifier with
file scope in the same name space if any of its associated headers is included.
Where "reserved for any use" means reserved for the compiler/standard library, see What's the meaning of "reserved for any use"? "Reserved for the implementation" also means reserved for the compiler/standard library.
Furthermore, Future library directions C17.31 reserve a lot of identifiers - it's a big chapter, I'll only quote the most notable parts:
7.31.10 Integer types <stdint.h>
Typedef names beginning with int or uint and ending with _t may be added to the
types defined in the <stdint.h> header. Macro names beginning with INT or UINT
and ending with _MAX, _MIN, or _C may be added to the macros defined in the
<stdint.h> header.
7.31.12 General utilities <stdlib.h>
Function names that begin with str and a lowercase letter may be added to the
declarations in the <stdlib.h> header.
7.31.13 String handling <string.h>
Function names that begin with str, mem, or wcs and a lowercase letter may be added to the declarations in the <string.h> header.
To answer your question directly:
So, where and where is it ok to use underscores?
Strictly speaking: nowhere. You should never declare identifiers starting with underscore, since they may clash with the standard library or language keywords etc. Though as is hinted from the bold text above, you may use one underscore followed by lower case in a local namespace.

Is there a significance to a leading underscore in the argument name of a function-like macro?

Some preprocessor macros I come across have arguments with names containing a leading underscore; for example, in the Linux kernel:
#define DEVICE_ATTR(_name, _mode, _show, _store) \
struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store)
These arguments appear to behave just like regular macro arguments, so I can't figure out why the author decided to have a leading underscore for each argument name. Is there some significance to the concatenation with _name, or are the underscores just a convention the author chose to use?
No, there is no special significance: these are regular identifiers. My guess as to why the authors decided to add underscores like that is to make the composition of these attributes more legible:
dev_attr_##_name
is easier to read than
dev_attr##name
The __ATTR, however, looks suspicious: in C, identifiers that start in an underscore followed by an uppercase letter or another underscore, are reserved for the implementation. In this case, it's two underscores, so I would expect __ATTR to be a system macro.

C universal macro names - gcc -fextended-identifiers

I'm looking for how can I write identifiers name with characters like [ ' " or #.
Everytime that I try to do that, I give the error:
error: macro names must be identifiers
But learning about gcc, I found this option:
-fextended-identifiers
But it seems not working like I wanted, please, somebody know how to accomplish that?
Identifiers can't include such characters. It is defined that way in the language syntax, identifiers are letters, digits or underline (and mustn't begin with a digit to avoid ambiguity with litteral numbers).
If it was possible this would conflict with the C compiler (that uses [ for arrays) and C preprocessor syntax (that uses #). Extended identifiers extension only allow using characters non forbidden by the language syntax inside identifiers (basically unicode foreign letters, etc.).
But if you really, really want to do this, nothings forbids you to preprocess your source files with your own "extended macro preprocessor", practically creating a new "C like" language. That looks like a terrible idea, but it's not really hard to do. Then you'll see soon enough by yourself why it's not a good idea...
According to this link, -fextended-identifiers only enables UTF-8 support for identifiers, so it won't help in your case.
So, answer is: You can't use such characters in macro identifiers.
Even if the extended identifier characters support was fully enabled, it wouldn't help you get characters such as:
[ ' " #
enabled for identifiers. The standard allows 'universal character names' or 'other implementation-defined characters' to be part of an identifier, but they cannot be part of the basic character set. Out of the basic character set, only _, letters and digits can be part of an identifier name (6.4.2.1 Identifiers/General).

What are the valid characters for macro names?

Are C-style macro names subject to the same naming rules as identifiers? After a compiler upgrade, it is now emitting this warning for a legacy application:
warning #3649-D: white space is required between the macro name "CHAR_" and its replacement text
#define CHAR_& 38
This line of code is defining an ASCII value constant for an ampersand.
#define DOL_SN 36
#define PERCENT 37
#define CHAR_& 38
#define RT_SING 39
#define LF_PAR 40
I assume that this definition (not actually referenced by any code, as far as I can tell) is buggy and should be changed to something like "CHAR_AMPERSAND"?
Macro names should only consist of alphanumeric characters and underscores, i.e. 'a-z', 'A-Z', '0-9', and '_', and the first character should not be a digit. Some preprocessors also permit the dollar sign character '$', but you shouldn't use it; unfortunately I can't quote the C standard since I don't have a copy of it.
From the GCC documentation:
Preprocessing tokens fall into five
broad classes: identifiers,
preprocessing numbers, string
literals, punctuators, and other. An
identifier is the same as an
identifier in C: any sequence of
letters, digits, or underscores, which
begins with a letter or underscore.
Keywords of C have no significance to
the preprocessor; they are ordinary
identifiers. You can define a macro
whose name is a keyword, for instance.
The only identifier which can be
considered a preprocessing keyword is
defined. See Defined.
This is mostly true of other languages
which use the C preprocessor. However,
a few of the keywords of C++ are
significant even in the preprocessor.
See C++ Named Operators.
In the 1999 C standard, identifiers
may contain letters which are not part
of the “basic source character set”,
at the implementation's discretion
(such as accented Latin letters, Greek
letters, or Chinese ideograms). This
may be done with an extended character
set, or the '\u' and '\U' escape
sequences. The implementation of this
feature in GCC is experimental; such
characters are only accepted in the
'\u' and '\U' forms and only if
-fextended-identifiers is used.
As an extension, GCC treats '$' as a
letter. This is for compatibility with
some systems, such as VMS, where '$'
is commonly used in system-defined
function and object names. '$' is not
a letter in strictly conforming mode,
or if you specify the -$ option. See
Invocation.
clang allows a lot of "crazy" characters.. although I have struggled to find any much rhyme or reason - as to why some are allowed, and others are not. For example..
#define 💩 ?: /// WORKS FINE
#define ■ #end /// WORKS FINE
#define 🅺 #interface /// WORKS FINE
#define P #protocol /// WORKS FINE
yet
#define ☎ TEL /// ERROR: Macro name must be an identifier.
#define ❌ NO /// ERROR: Macro name must be an identifier.
#define ⇧ UP /// ERROR: Macro name must be an identifier.
#define 〓 == /// ERROR: Macro name must be an identifier.
#define 🍎 APPLE /// ERROR: Macro name must be an identifier.
Who knows. I'd love to... but Google has thus failed me, so far. Any insight on the subject, would be appreciated™️.
You're right, the same rules apply to macro and identifiers as far as the names are concerned: valid characters are [A-Za-z0-9_].
It's common usage to use CAPITALIZED names to differentiate macros from other identifiers - variables and function name.
The same rules that specify valid identifiers for variable names apply to macro names with the exception that macros may have the same names as keywords. Valid characters in identifier names include digits and non-digits and must not start with a digit. non-digits include the uppercase letters A-Z, the lowercase letters a-z, the underscore, and any implementation defined characters.

Resources