Lex: multiple files, different rules - c

I have to parse more than one file, with different rules for each case. That is: I need some rules to work when processing a file, and be disabled afterwards.
I could simple use a global variable to track the state of the program, and have my rules decide inside their body whether to do anything useful or not, like this:
%{
static int state;
%}
%%
{something} {
if (state == SOMETHING_STATE) ...
}
{something_else} {
if (state == SOMETHING_ELSE_STATE) ...
}
%%
I'm guessing there's a better way to do this, though. Is there?

You want start states -- basically lex has a builtin state variable and you can annotate rules as only firing in certain states
%s state1
%s state2
%s state3
%%
<state1>{something} { // this rule matches only in state1
// change to state2
BEGIN(state2); }
<state1,state3>{something_else} { // this rule matches in state1 or state 3 }
{more_stuff} { // this rule matches in all states }
You can find documentation on this in any lex or flex book, or in the online flex documentation

Assuming you're using something like Flex rather than the original lex, you can use start states. When you parse something that changes how you do the parsing, you enter a state to do that parsing, and you tag the appropriate rules with that state. When you parse whatever marks the end of that state, you go do another:
{state_signal} { BEGIN(STATE2); }
<STATE2>{something} { handle_something(); }
<STATE2>{state3_signal} { BEGIN(STATE3); }
<STATE3>{something} {handle_something2(); }
<STATE3>{whatever} { BEGIN(START); }

Using Flex and Yacc you can also build diferent individual parsers and tweak the makefile in order to compile them into a single executable. I believe you're only using Flex, but this should be possible too.
I can't precise exactly how to do it now, but i've already actually done it,quite some time ago, so if you look around you'll find a way. Basiccly you'll have to compile each parser with a diferent prefix (-P), and that will generate diferently named parse functions and diferently named global variables for the parsers to use.

Related

Flex: REJECT rejects one character at a time?

I'm parsing C++-style scoped names, e.g., A::B. I want to parse such a name as a single token for a type name if it was previously declared as a type; otherwise, I want to parse it as three tokens A, ::, and B.
Given:
L [A-Za-z_]
D [0-9]
S [ \f\r\t\v]
identifier {L}({L}|{D})*
sname {identifier}({S}*::{S}*{identifier})+
%%
{sname} {
if ( find_type( yytext ) )
return Y_TYPE_NAME;
REJECT;
}
{identifier} {
// ...
return Y_NAME;
}
However, for the input sequence:
A::BB -> Not a type; returned as "A", "::", "BB"
(Declare A::B as a type.)
A::BB
What happens on the second parse of A::BB, REJECT is called and flex discards only 1 character from the end and tries to match A::B (one B). This matches the previously declared A::B in the {sname} rule which is wrong.
What I assumed REJECT did was to proceed to the second-best matching rule with the same input. Hence, I expected it to match A alone in {identifier} and just leave ::BB on the input stream. But, as shown, that's not what happens. It peels off one character at a time from the end of the input and re-attempts a match.
Adding in yyless() to chop off the ::BB doesn't help:
{sname} {
if ( find_type( yytext ) )
return Y_TYPE_NAME;
char const *const sep = strpbrk( yytext, ": \t" );
size_t keep_len = (size_t)(sep - yytext);
yyless( keep_len );
REJECT;
}
The only thing that I've found that works is:
{sname} {
if ( find_type( yytext ) )
return Y_TYPE_NAME;
char const *const sep = strpbrk( yytext, ": \t" );
size_t keep_len = (size_t)(sep - yytext);
yyless( keep_len );
goto identifier;
}
{identifier} {
identifier:
// ...
return Y_NAME;
}
But these seems kind of hack-ish. Is there a more canonical "flex way" to do what I want?
Despite the hack-ish-ness, is there actually anything wrong with my solution? I.e., would it not work in some cases? Or not work in some versions of Flex?
Yes, I'm aware that, even if it did work the way I want, this won't parse all contrived C++ scoped names; but it's good enough.
Yes, I'm aware that generally parsing such things should be done as separate tokens, but C-like languages are hard to parse since they're not context-free. Knowing when you have a type really helps.
Yes, I'm aware that REJECT slows down the lexer. This is (mostly) for an interactive and command-line tool in a terminal, so the human is the slowest component.
I'd like to focus on the problem at hand with the code as it mostly is. Mine is more of a question about how to use REJECT, yyless(), etc., to get the desired behavior. Thanks.
REJECT does not make any distinction between different rules; it just falls back to the next possible accepting pattern (which might not even be shorter, if there's a lower-precedence rule which matches the same token.) That might be a shorter match of the same pattern. (Normally, Flex chooses the longest match out of the possible matches of the regular expression. With REJECT, the shorter matches are also considered.)
So you can avoid the false match of A::B for input A::BB by using trailing context: [Note 1]
{sname}/[^[:alnum:]_] {...}
(In Flex, you can't put trailing context in a macro because the expansion is surrounded with parentheses.)
You could use that solution if you wanted to try all possible complete prefixes of id1::id2::id3::id4 starting with the longest one (id1::id2::id3::id4) and falling back to each shorter prefix in turn (id1::id2::id3, id1::id2, id1). The trailing context prevents intermediate matches in the middle of an identifier.
I'm not sure if that is the problem you are trying to solve, though. And even if it is, it doesn't really solve the problem of what to do after you fall back to an acceptable solution, because you probably don't want to redo the search for prefixes in the middle of the sequence. In other words, if the original were A::B::A::B, and A::B is a "known prefix", the tokens returned should (presumably) be A::B, ::, A, ::, B rather than A::B, ::, A::B.
On the other hand, you might not care about previously defined prefixes; only whether the complete sequence is a known type. In that case, the possible token sequences for A::B::C::D are [TYPENAME(A::B::C::D)] and [ID(A),, ::, ID(B), ::, ID(C), ID(D)]. For this case, you don't want to restrict the {sname} fallbacks; you want to eliminate them. After the fallback, you only want to try matches for different rules.
There are a variety of alternative solutions (which do not use REJECT), mostly relying on the possibility of using start conditions. (You still need to think about what happens after the fallback. Start conditions can be useful for that, as well.)
One fairly general solution is to define a start condition which excludes the rule you want to fall back from. Once you've determined that you want to fall back to a different rule, you change to a start condition which excludes the rule which just matched and then call yyless(0) (without returning). That will cause Flex to rescan the current token from the beginning in the new start condition. (If you have several possible rules which might match, you will need several start conditions. This could get out of hand, but in most cases the possible set of matching rules is very limited.)
At some point, you need to return to the previous start condition. (You could use a start condition stack if it's not trivial to figure out which was the previous start condition.) Ideally, to avoid false interior matches, you would want to return to the previous start condition only when you reached the end of the first (incorrect) match. That's easy to do if you are tracking the input position for each token; you just need to save the position just before calling yyless(0). (You also need to correct the token input position by subtracting yyleng before yyless sets yyleng to 0).
Rescanning from the beginning of the token might seem inefficient, but it's less inefficient than the overhead imposed by REJECT. And the overhead of REJECT affects the entire scanner operation, while the rescanning solution is essentially free except for the tokens you happen to rescan. [Note 2]
(BEGIN(some_state); yyless(0);) is a reasonably common flex idiom; this is not the only use case. Another one is the answer to the question "How do I run a start condition until I reach a token I can't identify without consuming that token?")
But I think that in this case there is a simpler solution, using yymore to accumulate the token. (This avoids having to do your own dynamically expanded token buffer.) Again, there are two possibilities, depending on whether you might allow initial prefix matches or restrict the possibilities to either the full sequence or the first identifier.
But the outline is the same: Match the shortest valid possibility first, remember how long the token was at that point, and then use a start condition to enable a pattern which extends the token, either to the next possibility or to the end of the sequence, depending on your needs. Before continuing with the scanner loop, you call yymore() which indicates to Flex that the next pattern extends the token rather than replacing it.
At each possible match end, you test to see if that would really be a valid token and if so, record the position (and whatever else you might need to recall, such as the token type). When you reach a point where you can no longer extend the match, you use yyless to fall back to the last valid match point, and return the token.
This is slightly less inefficient than the pure yyless() solution because it avoids the rescan of the token which is returned. (All these solutions, including REJECT, do rescan the text following the selected match if it is shorter than the longest possible extended match. That's theoretically avoidable but since it's not a lot of overhead, it doesn't seem worthwhile to build a complex mechanism to avoid it.)
Again, you probably want to avoid trying to extend token matches after the fallback until you reach the longest extent. This can be solved the same way, by recording the longest matched extent, but the start condition handling is a bit simpler.
Here's some not-very-well tested code for the simpler problem, where only the first identifier and the full match are possible:
%{
/* Code to track current token position and the fallback position.
* It's a simple byte count; line-ends are not dealt with.
*/
static int token_pos = 0;
static int token_max_pos = 0;
/* This is done before every action, even empty actions */
#define YY_USER_ACTION token_pos += yyleng;
/* FALLBACK needs to undo YY_USER_ACTION */
#define FALLBACK(to) do { token_pos -= yyleng - to; yyless(to); } while (0)
/* SET_MORE needs to pre-undo the next YY_USER_ACTION */
#define SET_MORE(X) do { token_pos -= yyleng; yymore(); } while(0)
%}
%x EXTEND_ID
ident [[:alpha:]_][[:alnum:]_]*
%%
int fallback_leng = 0;
/* The trailing context here is to avoid triggering EOF in
* the EXTEND_ID state.
*/
{ident}/[^[:alnum:]_] {
/* In the fallback region, don't attempt to extend the match. */
if (token_pos <= token_max_pos)
return Y_IDENT;
fallback_leng = yyleng;
BEGIN(EXTEND_ID);
SET_MORE(yymore);
}
{ident} { return find_type(yytext) ? Y_TYPE_NAME : Y_IDENT; }
<EXTEND_ID>{
([[:space:]]*"::"[[:space:]]*{ident})*|.|\n {
BEGIN(INITIAL);
if (yyleng == fallback_leng + 1)
FALLBACK(fallback_leng);
if (find_type(yytext))
return Y_TYPE_NAME;
else {
FALLBACK(fallback_leng);
return Y_IDENT;
}
}
}
Notes
At least, I think you can do that. I haven't ever tried and REJECT does impose a number of limitations on other scanner features. Really, it's a historical artefact which massively slows down lexical analysis and should generally be avoided.
The use of REJECT anywhere in the rules causes Flex to switch to a different template in which a list of endpoints is retained, which has a cost in both time and space. Another consequence is that Flex can no longer resize the input buffer.

solr : how to add extra fields values not in the document

I know lucene, just started to learn how to use solr. In the simple example, the way to add document is to used the example ../update -jar post.jar to add document, the question is without writing my own add document in java, using the same way (... post.jar), is there a way to add additional fields not in the document? For example, say my schema include name, age, id fields, but the document has no 'id' field but I want the id and its value to be included, of course I know what id and value I want but how do I include it?
Thanks in advanced!
I don't believe you can mix the two. You can use post.jar to add documents using arguments passed in on the commandline, a file, stdin or a simple crawl from a web page but there is no way to combine them. In the source code for post.jar you can see it's a series else if statements so they are mutually exclusive.
-Ddata args, stdin, files, web
Use args to pass arguments along the command line (such as a command
to delete a document). Use files to pass a filename or regex pattern
indicating paths and filenames. Use stdin to use standard input. Use
web for a very simple web crawler (arguments for this would be the URL
to crawl).
https://cwiki.apache.org/confluence/display/solr/Simple+Post+Tool
/**
* After initialization, call execute to start the post job.
* This method delegates to the correct mode method.
*/
public void execute() {
final long startTime = System.currentTimeMillis();
if (DATA_MODE_FILES.equals(mode) && args.length > 0) {
doFilesMode();
} else if(DATA_MODE_ARGS.equals(mode) && args.length > 0) {
doArgsMode();
} else if(DATA_MODE_WEB.equals(mode) && args.length > 0) {
doWebMode();
} else if(DATA_MODE_STDIN.equals(mode)) {
doStdinMode();
} else {
usageShort();
return;
}
if (commit) commit();
if (optimize) optimize();
final long endTime = System.currentTimeMillis();
displayTiming(endTime - startTime);
}
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/util/SimplePostTool.java
You could try to modify the code but I think a better bet would be to either pre-process your xml files to include the missing fields, or learn to use the API (either via Java or hitting it with Curl) to do this on your own.

How do I turn off a static code analysis warning on a line by line warning in CDT (C code)?

We have a project using CDT in Eclipse. It's an old project that we just imported into Eclipse, and I want to ensure we start using static code analysis to find any weirdnesses.
The thing is, there are a bunch of lines that trigger warnings that we want to just ignore, with the main ones being fallthroughs within switch statements.
I know how to do this for lint, but what about for CDT? Is there a single-line comment that I can put right above the line?
Example: ("No break at the end of case")
case enChA:
nChannel++;
// I want to ignore this fallthrough
case enChB:
nChannel++;
// And this one...
case enChC:
nChannel++;
// And this one...
case enChD:
nChannel++;
// do some more stuff...
break;
You should try
//no break
before the next case.
These settings are located under Window -> Preferences -> C/C++ -> Code Analysis. You can customize the settings. For example if you pick No break at the end of case, you can define the comment that suppresses the warning. By default it's "no break". So coincidentally copy/pasting the warning message into the comment worked in your case:
As you can see the text doesn't have to be an exact match and it doesn't seem to be case sensitive either.
Referring to your follow-up question about unused variables: When you customize Unused variable in file scope you can define variable names that should be ignored:
There are two cryptic predefined exceptions "#(#)" and "$Id". Unfortunately I couldn't find any official documentation so I went looking into the source. It looks like the checker simply tests if a variable name contains() any of the specified exceptions. If it does, the warning is suppressed.
Outside of Eclipse CDT, there's the popular void-casting trick. If the compiler warns about an unused variable, cast it to void. This operation has no effect, so it's safe, but from the perspective of the compiler that variable is now used. I usually wrap it in a macro to make abundantly clear what I'm doing, e.g.
#define UNUSED(var) (void)(var)
void foobar()
{
int b; // not used.
UNUSED(b); // now it's used
}
Solved it.
I just added the text from the warning that I wanted to ignore to immediately above where the break would be.
Like this:
case enChC:
++nChannel;
//No break at the end of case
case enChD:
++nChannel;
As is has been said, in this specific case, it can be solved adding the comment:
//no break
or:
//no break at the end of case
What really matters is the (no break).
But also, it is required that you don't have more comments between the end of this case and the next one or it won't work. For example the next case will still result in a warning:
case enChC:
++nChannel;
//No break
//This second case decrease the value
case enChD:
++nChannel;
You have to upgrade to Eclipse Oxygen.3 (or.2).
Beginning with these versions warnings/markers can be suppressed by simply using "Quick Fix".
UPDATE 2021
Version: 2021-03 (4.19.0)
Build id: 20210312-0638
Assuming you have this switched on
window -> Preferences -> (Your code example) C/C++ -> Code Analysis -> No break at end of case
If you go to Customize Selected..., you will get "Comment text to suppress the problem [...]" which declares which data stream will actually only change the effect but still won't fix the problem.
The relevant line is in the file org.eclipse.cdt and looks like this :
org.eclipse.cdt.codan.internal.checkers.CaseBreakProblem.params={launchModes=>{RUN_ON_FULL_BUILD=>true,RUN_ON_INC_BUILD=>true,RUN_ON_FILE_OPEN=>false,RUN_ON_FILE_SAVE=>false,RUN_AS_YOU_TYPE=>true,RUN_ON_DEMAND=>true},suppression_comment=>"#suppress(\"No
break at end of case\")",no_break_comment=>"no
break",last_case_param=>false,empty_case_param=>false,enable_fallthrough_quickfix_param=>false}
Having this like in my example will work with the code below.
Note that the content of the comment does not matter, but it must contain an identical fragment corresponding to the settings contained in the Eclipse itself
Additionally, it is important to put the comment after the } character, unless it is a single argument case
Eclipse will still report the problem and this cannot be avoided in any way by having the bug report enabled in the settings, but it will no longer underline every line in the case
switch(next) {
case 0: {
if(true) {
//Do stuff...
} else {
break;
}
next = 1;
}
// ThiS Is The Line that CAUSE N"no break////////"
case 1: {
if(true) {
//Do stuff...
} else {
break;
}
next = 2;
}
case 2: {
if(true) {
//Wont do stuff...
//It will break here
break;
} else {
next = 3;
}
}
I present a photo that shows the effect in the eclipse itself.
I have encountered this question,and I just want to eliminate them.
I tried to add /* no break */,you should make sure it is added before the next "case".
This is the question I have encountered
This is the solution I useļ¼š

Emacs indentation for multi-level nesting of C code

I'm completely new to emacs (mostly used vim and eclipse/netbeans etc.) I was playing with multi-level nesting of C code and wrote a sample code in emacs to test how it indents codes where nesting is way too deep (not real life code though).
int foo()
{
if (something) {
if (anotherthing) {
if (something_else) {
if (oh_yes) {
if (ah_now_i_got_it) {
printf("Yes!!!\n");
}
}
}
}
}
}
This looked exactly like this as I typed in emacs and saved it. But opening it on a different text editor showed the actual text saved is this:
int foo()
{
if (something) {
if (anotherthing) {
if (something_else) {
if (oh_yes) {
if (ah_now_i_got_it) {
printf("Yes!!!\n");
}
}
}
}
}
}
So I was wondering is there any way in emacs to save the text the way it actually displays?
My current c-default-style is set to "linux".
EDIT:
Ok, I was using Notepad++/Vim to view the file saved by emacs and it showed that "wrong" indentation, but looks like, opening with good old notepad (or even doing a cat file.c) shows the correct indentation as displayed by emacs. Will try the other approaches mentioned here. Thanks!
Try using spaces instead of tabs for indentation. Add the following to your init.el:
(setq-default indent-tabs-mode nil)
This will make all buffers use spaces by default. You will want to add the following exception for makefiles:
(add-hook 'makefile-mode-hook (lambda () (setq indent-tabs-mode t)))

Alternative to putting an include inside a While loop

I have a chunk of PHP code that I'd like to include on a number of different pages but be able to update in one location (hence my use of an include file). However, the chunk of code needs to appear inside a while loop -- specifically inside a while loop that is echoing out MySQL rows.
However, there are roughly 200 rows in the MySQL query I'm echoing, so having an include in the loop really slows things down. I've tried making what's in the include file a function, like shown below, then including once at the top of the page and referencing the function inside the loop, but it it doesn't seem to work (I just don't get any data in the variables I'm setting, etc.)
How does one put a chunk of code inside a loop without using include?
Thanks very much.
function CYCalc()
{
// If the company's current fiscal quarter
// is equal to the current calendar quarter,
// use the company's fiscal years as calendar years
if ($UniverseResult[CurQ] == "Q1" && $UniverseResult[CurYear] == "2012") {
$C2011Sales = number_format($UniverseResult[SalesYear2]/1000000,1);
$C2012Sales = number_format($UniverseResult[SalesYear3]/1000000,1);
$C2011EPS = $UniverseResult[EPSYear2];
$C2012EPS = $UniverseResult[EPSYear3];
}
}
Remember PHP's scoping rules. Variables defined in the global scope are not visible inside functions unless you explicitly declare them as global within the function:
<?php
$x = 7;
function y() {
echo $x; // undefined
}
function z() {
global $x;
echo $x; // 7
}
function a($x) {
echo $x; // 7
}
For your CYCalc() to work, you'd need to declare $UniverseResult global as per z() above, or pass it in as a parameter as per a() above.

Resources