I am trying to write a regex to detect IP addresses and floating point number in re2c (http://re2c.org/). Here is the regex I am using
<SYMBOL> [-+]?[0-9]+[.][0-9]+ { RETURN(FLOAT); }
<SYMBOL> [0-9]{1,3}'.'[0-9]{1,3}'.'[0-9]{1,3}'.'[0-9]{1,3} {RETURN (IPADDR); }
Whenever I compile, it throws error about some YYMARKER being undeclared. But if I use only one of the rules the compilation goes fine. I guess re2c is having trouble with backtracking based regex since both the rules have a large data set with common prefix (for example 192.132 could be starting of both a floating point number as well as ip address).
Here is the command line I am using to first generate the tokenizer file. re2c itself does not throw any error.
re2c -c -o tokenizer.c tokenizer.re
But when i compile the C file i get the following error.
tokenizer.c: In function 'getnext_querytoken':
tokenizer.c:74: error: 'YYMARKER' undeclared (first use in this function)
tokenizer.c:74: error: (Each undeclared identifier is reported only once
tokenizer.c:74: error: for each function it appears in.)
Is there any way I can solve this problem ?
#sushil, you are right: YYMARKER is a part of re2c API.
However, re2c is not "having trouble with backtracking based regex since both the rules have a large data set". re2c-generated lexers only iterate the input once (complexity is linear). YYMARKER is needed because of the overlapping rules, as explained in this example: http://re2c.org/examples/example_01.html :
YYMARKER (line 5) is needed because rules overlap: it backups input position of the longest successful match. Say, we have overlapping rules "a" and "abc" and input string "abd": by the time "a" matches there's still a chance to match "abc", but when lexer sees 'd' it must rollback. (You might wonder why YYMARKER is exposed at all: why not make it a local variable like yych? The reason is, all input pointers must be updated by YYFILL as explained in Arbitrary large input and YYFILL example.)
Looks like i did not read the manpage properly. According to the manpage I needed to manually define the variable YYMARKER to support backtracking in re2c. Here is the extract from http://re2c.org/manual.html
YYMARKER l-value of type * YYCTYPE. The generated code saves
backtracking information in YYMARKER. Some easy scanners might not use
this.
Related
Compilation generally occur in several stages:lexical analysis, syntax analysis, etc. Say, in C language, I wrote
a=24;
without declaring a as int. Now, at what stage of compilation an error is detected? At syntax analysis stage? If that is the case, then what does lexical analyzer do? Just tokenizing the source code?
If talking about a general form of compiler,it is obvious that the error will occur at the syntax analysis phase when the parser will look for the symbol searching in symbol table entries ,and the subsequent phases - only if processed further after recovering from error.
The dragon book also clearly tells that. It is mentioned in the page where the types of error are mentioned. The topic to be studied thoroughly to understand this issue is given in 4.1.3 - Syntax Error Handling .
a = 24; // without declaring a as an int type variable.
Here, the work of lexical phase is simply to access characters and form tokens and subsequently pass them to the further phases,i.e., to the parse in the syntax analysis phase,etc.
I don't know your compiler, but in general this would be in the parsing stage (syntax analysis) and not the lexical stage (tokenizing). Most C compilers will be written using a lex/yacc variant, which makes the above assumption more plausible. If you want to know the details, dive into the dragon book, a great resource.
If I were to write the compiler, I'd have the lexical analyzer spit out tokens (in this case: a, =, 24 and finally ;). The parser would maintain a symbol table and upon seeing the symbol a it would check whether the symbol was in the table; if not (as in your example) it would signal an error.
I'm new to C and looking at Go's source tree I found this:
https://code.google.com/p/go/source/browse/src/pkg/runtime/race.c
void runtime∕race·Read(int32 goid, void *addr, void *pc);
void runtime∕race·Write(int32 goid, void *addr, void *pc);
void
runtime·raceinit(void)
{
// ...
}
What do the slashes and dots (·) mean? Is this valid C?
IMPORTANT UPDATE:
The ultimate answer is certainly the one you got from Russ Cox, one of Go authors, on the golang-nuts mailing list. That said, I'm leaving some of my earlier notes below, they might help to understand some things.
Also, from reading this answer linked above, I believe the ∕ "pseudo-slash" may now be translated to regular / slash too (like the middot is translated to dot) in newer versions of Go C compiler than the one I've tested below - but I don't have time to verify.
The file is compiled by the Go Language Suite's internal C compiler, which originates in the Plan 9 C compiler(1)(2), and has some differences (mostly extensions, AFAIK) to the C standard.
One of the extensions is, that it allows UTF-8 characters in identifiers.
Now, in the Go Language Suite's C compiler, the middot character (·) is treated in a special way, as it is translated to a regular dot (.) in object files, which is interpreted by Go Language Suite's internal linker as namespace separator character.
Example
For the following file example.c (note: it must be saved as UTF-8 without BOM):
void ·Bar1() {}
void foo·bar2() {}
void foo∕baz·bar3() {}
the internal C compiler produces the following symbols:
$ go tool 8c example.c
$ go tool nm example.8
T "".Bar1
T foo.bar2
T foo∕baz.bar3
Now, please note I've given the ·Bar1() a capital B. This is
because that way, I can make it visible to regular Go code - because
it is translated to the exact same symbol as would result from
compiling the following Go code:
package example
func Bar1() {} // nm will show: T "".Bar1
Now, regarding the functions you named in the question, the story goes further down the rabbit hole. I'm a bit less sure if I'm right here, but I'll try to explain based on what I know. Thus, each sentence below this point should be read as if it had "AFAIK" written just at the end.
So, the next missing piece needed to better understand this puzzle, is to know something more about the strange "" namespace, and how the Go suite's linker handles it. The "" namespace is what we might want to call an "empty" (because "" for a programmer means "an empty string") namespace, or maybe better, a "placeholder" namespace. And when the linker sees an import going like this:
import examp "path/to/package/example"
//...
func main() {
examp.Bar1()
}
then it takes the $GOPATH/pkg/.../example.a library file, and during import phase substitutes on the fly each "" with path/to/package/example. So now, in the linked program, we will see a symbol like this:
T path/to/package/example.Bar1
The "·" character is \xB7 according to my Javascript console.
The "∕" character is \x2215.
The dot falls within Annex D of the C99 standard lists which special characters which are valid as identifiers in C source. The slash doesn't seem to, so I suspect it's used as something else (perhaps namespacing) via a #define or preprocessor magic.
That would explain why the dot is present in the actual function definition, but the slash is not.
Edit: Check This Answer for some additional information. It's possible that the unicode slash is just allowed by GCC's implementation.
It appears this is not standard C, nor C99. In particular, it both gcc and clang complain about the dot, even when in C99 mode.
This source code is compiled by the Part 9 compiler suite (in particular, ./pkg/tool/darwin_amd64/6c on OS X), which is bootstrapped by the Go build system. According to this document, bottom of page 8, Plan 9 and its compiler do not use ASCII at all, but use Unicode instead. At bottom of page 9, it it stated that any character with a sufficiently high code point is considered valid for use in an identifier name.
There's no pre-processing magic at all - the definition of functions do not match the declaration of functions simply because those are different functions. For example, void runtime∕race·Initialize(); is an external function whose definition appears in ./src/pkg/runtime/race/race.go; likewise for void runtime∕race·MapShadow(…).
The function which appears later, void runtime·raceinit(void), is a completely different function, which is aparant by the fact it actually calls runtime∕race·Initialize();.
The go compiler/runtime is compiled using the C compilers originally developed for plan9. When you build go from source, it'll first build the plan9 compilers, then use those to build Go.
The plan9 compilers support unicode function names [1], and the Go developers use unicode characters in their function names as pseudo namespaces.
[1] It looks like this might actually be standards compliant: g++ unicode variable name but gcc doesn't support unicode function/variable names.
I am fairly new into regexes, so I wrote the following simple regex using positive lookahead that detects functions and function calls in a C source file-
\w+(?=\s*\()
It works fine, but the problem is it detects non-function syntaxes like if(), while()etc too.
I can easily avoid this by saying-
(if(?!\()) | (while(?!\())
But the problem is how to combine the second regex with the first one? I cant OR them, cos the first one still matches if(), while() etc and in an OR expression, its enough if one of the term matches.
How to combine these regexes or have a better simpler one which will not match non-function syntaxes like if(), while()
PS: I use the following tools to test my regexes
GSkinner
RegexPal
There are quite a lot of assumptions when you are searching for function call in C with regex. That aside, if you are happy with what is matched (there are valid function calls that will not be matched), and you want to exclude if and while from the result list, you can use the following regex:
(?!\b(if|while|for)\b)\b\w+(?=\s*\()
The regex uses word boundary \b to make sure that the whole name is matched (prevent partial matching of hile in while), and the whole name is not keyword (prevent rejection of whilenothinghappens).
How do I customize the default action for flex. I found something like <*> but when I run it it says "flex scanner jammed"? Also the . rule only adds a rule so it does not work either. What I want is
comment "/*"[^"*/"]*"*/"
%%
{comment} return 1;
{default} return 0;
<<EOF>> return -1;
Is it possible to change the behavior of matching longest to match first? If so I would do something like this
default (.|\n)*
but because this almost always gives a longer match it will hide the comment rule.
EDIT
I found the {-} operator in the manual, however this example straight from the manual gives me "unrecogized rule":
[a-c]{-}[b-z]
The flex default rule matches a single character and prints it on standard output. If you don't want that action, write an explicit rule which matches a single character and does something else.
The pattern (.|\n)* matches the entire input file as a single token, so that is a very bad idea. You're thinking that the default should be a long match, but in fact you want that to be as short as possible (but not empty).
The purpose of the default rule is to do something when there is no match for any of the tokens in the input language. When lex is used for tokenizing a language, such a situation is almost always erroneous because it means that the input begins with a character which is not the start of any valid token of the language.
Thus, a "catch any character" rule is coded as a form of error recovery. The idea is to discard the bad character (just one) and try tokenizing from the character after that one. This is only a guess, but it's a good guess because it's based on what is known: namely that there is one bad character in the input.
The recovery rule can be wrong. For instance suppose that no token of the language begins with #, and the programmer wanted to write the string literal "#abc". Only, she forgot the opening " and wrote #abc". The right fix is to insert the missing ", not to discard the #. But that would require a much more clever set of rules in the lexer.
Anyway, usually when discarding a bad character, you want to issue an error message for this case like "skipping invalid character '~` in line 42, column 3".
The default rule/action of copying the unmatched character to standard output is useful when lex is used for text filtering. The default rule then brings about the semantics of a regex search (as opposed to a regex match): the idea is to search the input for matches of the lexer's token-recognizing state machine, while printing all material that is skipped by that search.
So for instance, a lex specification containing just the rule:
"foo" { printf("bar"); }
will implement the equivalent of
sed -e 's/foo/bar/g'
I solved the problem manually instead if trying to match the complement of a rule. This works fine because the matching pattern involved in this case is quite simple.
Why does adding "." not do the trick? You can't perform an action in the absence of a matched amount. flex won't do anything if there is no match, so to add a "default" rule, just make it match something.
<*>.|\n /* default action here */
Using this at the end of the file catches the default rule across all start spaces. It's useful to find out where there may be holes.
What I don't know (and would like to know) is how to get flex to report where the default rule match has been found.
I am looking for a strange macro definition, on purpose: I need a macro defined in such a way, that in the event the macro is effectively used in compiled code, the compiler will unfailingly produce an error.
The background: Since C11 had introduced several new keywords, and a new C++11 standard also added a few, I would like to introduce a header file in my projects (mostly using C89/C95 compilers with a few additions) to force developers to refrain from using these new keywords as identifier names, unless, of course, they are recognized as keywords in the intended fashion.
In the ancient past, I did this for new like this:
#define new *** /* C++ keyword, do not use */
And yes, it worked. Until it didn't, when a programmer forgot the underscore in a parameter name:
void myfunction(uint16_t new parameter);
I used variants since, but I've never been challenged again.
Now I intend to create a file with all keywords not supported by various compilers, and I'm looking for a dependable solution, at best with a not too confusing error message. "Syntax error" would be OK, but "parameter missing" would be confusing already.I'm thinking along the lines of
#define atomic +*=*+ /* C11 derived keyword; do not use */
and aside from my usual hesitation, I'm quite sure that any use (but not the definition) of the macro will produce an error.
EDIT: To make it even more difficult, MISRA will only allow the use of the basic source and execution character set, so # or $ are not allowed.
But I'd like to ask the community: Do you have a better macro value? As effective, but shorter? Or even longer but more dependable in some strange situation? Or a completely different method to generate an error (only using the compiler, please, not external tools!) when a "discouraged" identifier is used for any purpose?
Disclaimer:
And, yes, I know I can use a grep or a parser to run on a nightly build, and report the warnings it finds. But dropping an immediate error on the developers desk is quicker, and certain to be fixed before checking in.
If the sport is for the shortest tokensequence that always produces an error, any combination of two 1 character operators that can't legally occur together, but
don't use ({ or }) because gcc has a special meaning for that
don't use any sort of unbalanced parentheses because they can lead you far away until the error is recognized
don't use < or > because they could match template parameters for C++
don't use prefix operators as second character
don't use postfix operators as first character
This leave some possibilities
.., .| and other combinations with . since . expects a following identifier
&|, &/, &^, &,, &;
!|, !/, !^, !,, !;
But actually to be more user friendly I'd also first place a _Pragma in it so the compiler would also spit a warning.
#define atomic _Pragma("message \"some instructive text that you should read\"") ..
I think you can just use an illegal symbol:
#define bad_name #
Another one that would work would be this:
static const char *illegal_keyword = "";
#define bad_name (illegal_keyword = "bad_name")
It would error you that you are changing a constant. Also, the error message will usually be quite good:
Line 8: error: called object 'illegal_keyword = "printf"' is not a function
And the final one that is perhaps the shortest and will always work is this:
#define bad_name #
Because the preprocessor will never replace twice, and # is illegal outside of the prepocessor this will always error.
#define atomic do not use atomic
The expansion is not recursive so it stops. The only way to stop it from being a compilation error is:
#define do
#define not
#define use
but that's verboten because do and not are keywords.
The error message might even include 'atomic'. You can increase the probability of that by rephrasing the message:
#define atomic atomic cannot be used
(Now you are not playing with keywords in the message, though.)
I think [[]] isn't a valid sequence of tokens anywhere, so you could use that:
#define keyword [[]]
The error will be a syntax error, complaining about [ or ].
My attempt:
#define new new[-1]
#define atomic atomic[-1]