I am trying to implement a very simple parser using flex. I am currently stuck in the ID recognition. That is my code:
ID [a−zA−Z_][a−zA−Z0−9_]*
...
{ID} { printf( "An identifier: %s\n", yytext ); return TOK_ID;}
However what I get is only the first letter of the identifier, for example if I try to parse:
int _underscore ;
The result is:
An identifier: _
Any advice?
EDIT:
With a more accurate analysis I have figured out that the code is able to recognize only the id with a,z,A,Z,_, that are the explicit characters in the regular expression. I did not find anything like that online, is that a bug?
EDIT2:
If I modify the code in that way all work
ID [a−zA−Z_][a−zA−Z0−9_]*
...
[a−zA−Z_][a−zA−Z0−9_]* { printf( "An identifier: %s\n", yytext ); return TOK_ID;}
According to the documentation it should work also in the other way.
This is a character encoding issue. In your copy-and-pasted source code, the things that look like ASCII hyphens (-, code U+2D) in your definition of ID:
ID [a−zA−Z_][a−zA−Z0−9_]*
aren't. Instead they're unicode minus signs (−, U+2212). If you replace the incorrect minus signs with the correct hyphens, the line will look like:
ID [a-zA-Z_][a-zA-Z0-9_]*
Depending on your font, if you look very closely, you may see a difference between the − in the first version and the - in the second.
Anyway, replace your ID definition with the second version above (or else retype it from scratch, and all should be well.
Related
I'm parsing C++-style scoped names, e.g., A::B. I want to parse such a name as a single token for a type name if it was previously declared as a type; otherwise, I want to parse it as three tokens A, ::, and B.
Given:
L [A-Za-z_]
D [0-9]
S [ \f\r\t\v]
identifier {L}({L}|{D})*
sname {identifier}({S}*::{S}*{identifier})+
%%
{sname} {
if ( find_type( yytext ) )
return Y_TYPE_NAME;
REJECT;
}
{identifier} {
// ...
return Y_NAME;
}
However, for the input sequence:
A::BB -> Not a type; returned as "A", "::", "BB"
(Declare A::B as a type.)
A::BB
What happens on the second parse of A::BB, REJECT is called and flex discards only 1 character from the end and tries to match A::B (one B). This matches the previously declared A::B in the {sname} rule which is wrong.
What I assumed REJECT did was to proceed to the second-best matching rule with the same input. Hence, I expected it to match A alone in {identifier} and just leave ::BB on the input stream. But, as shown, that's not what happens. It peels off one character at a time from the end of the input and re-attempts a match.
Adding in yyless() to chop off the ::BB doesn't help:
{sname} {
if ( find_type( yytext ) )
return Y_TYPE_NAME;
char const *const sep = strpbrk( yytext, ": \t" );
size_t keep_len = (size_t)(sep - yytext);
yyless( keep_len );
REJECT;
}
The only thing that I've found that works is:
{sname} {
if ( find_type( yytext ) )
return Y_TYPE_NAME;
char const *const sep = strpbrk( yytext, ": \t" );
size_t keep_len = (size_t)(sep - yytext);
yyless( keep_len );
goto identifier;
}
{identifier} {
identifier:
// ...
return Y_NAME;
}
But these seems kind of hack-ish. Is there a more canonical "flex way" to do what I want?
Despite the hack-ish-ness, is there actually anything wrong with my solution? I.e., would it not work in some cases? Or not work in some versions of Flex?
Yes, I'm aware that, even if it did work the way I want, this won't parse all contrived C++ scoped names; but it's good enough.
Yes, I'm aware that generally parsing such things should be done as separate tokens, but C-like languages are hard to parse since they're not context-free. Knowing when you have a type really helps.
Yes, I'm aware that REJECT slows down the lexer. This is (mostly) for an interactive and command-line tool in a terminal, so the human is the slowest component.
I'd like to focus on the problem at hand with the code as it mostly is. Mine is more of a question about how to use REJECT, yyless(), etc., to get the desired behavior. Thanks.
REJECT does not make any distinction between different rules; it just falls back to the next possible accepting pattern (which might not even be shorter, if there's a lower-precedence rule which matches the same token.) That might be a shorter match of the same pattern. (Normally, Flex chooses the longest match out of the possible matches of the regular expression. With REJECT, the shorter matches are also considered.)
So you can avoid the false match of A::B for input A::BB by using trailing context: [Note 1]
{sname}/[^[:alnum:]_] {...}
(In Flex, you can't put trailing context in a macro because the expansion is surrounded with parentheses.)
You could use that solution if you wanted to try all possible complete prefixes of id1::id2::id3::id4 starting with the longest one (id1::id2::id3::id4) and falling back to each shorter prefix in turn (id1::id2::id3, id1::id2, id1). The trailing context prevents intermediate matches in the middle of an identifier.
I'm not sure if that is the problem you are trying to solve, though. And even if it is, it doesn't really solve the problem of what to do after you fall back to an acceptable solution, because you probably don't want to redo the search for prefixes in the middle of the sequence. In other words, if the original were A::B::A::B, and A::B is a "known prefix", the tokens returned should (presumably) be A::B, ::, A, ::, B rather than A::B, ::, A::B.
On the other hand, you might not care about previously defined prefixes; only whether the complete sequence is a known type. In that case, the possible token sequences for A::B::C::D are [TYPENAME(A::B::C::D)] and [ID(A),, ::, ID(B), ::, ID(C), ID(D)]. For this case, you don't want to restrict the {sname} fallbacks; you want to eliminate them. After the fallback, you only want to try matches for different rules.
There are a variety of alternative solutions (which do not use REJECT), mostly relying on the possibility of using start conditions. (You still need to think about what happens after the fallback. Start conditions can be useful for that, as well.)
One fairly general solution is to define a start condition which excludes the rule you want to fall back from. Once you've determined that you want to fall back to a different rule, you change to a start condition which excludes the rule which just matched and then call yyless(0) (without returning). That will cause Flex to rescan the current token from the beginning in the new start condition. (If you have several possible rules which might match, you will need several start conditions. This could get out of hand, but in most cases the possible set of matching rules is very limited.)
At some point, you need to return to the previous start condition. (You could use a start condition stack if it's not trivial to figure out which was the previous start condition.) Ideally, to avoid false interior matches, you would want to return to the previous start condition only when you reached the end of the first (incorrect) match. That's easy to do if you are tracking the input position for each token; you just need to save the position just before calling yyless(0). (You also need to correct the token input position by subtracting yyleng before yyless sets yyleng to 0).
Rescanning from the beginning of the token might seem inefficient, but it's less inefficient than the overhead imposed by REJECT. And the overhead of REJECT affects the entire scanner operation, while the rescanning solution is essentially free except for the tokens you happen to rescan. [Note 2]
(BEGIN(some_state); yyless(0);) is a reasonably common flex idiom; this is not the only use case. Another one is the answer to the question "How do I run a start condition until I reach a token I can't identify without consuming that token?")
But I think that in this case there is a simpler solution, using yymore to accumulate the token. (This avoids having to do your own dynamically expanded token buffer.) Again, there are two possibilities, depending on whether you might allow initial prefix matches or restrict the possibilities to either the full sequence or the first identifier.
But the outline is the same: Match the shortest valid possibility first, remember how long the token was at that point, and then use a start condition to enable a pattern which extends the token, either to the next possibility or to the end of the sequence, depending on your needs. Before continuing with the scanner loop, you call yymore() which indicates to Flex that the next pattern extends the token rather than replacing it.
At each possible match end, you test to see if that would really be a valid token and if so, record the position (and whatever else you might need to recall, such as the token type). When you reach a point where you can no longer extend the match, you use yyless to fall back to the last valid match point, and return the token.
This is slightly less inefficient than the pure yyless() solution because it avoids the rescan of the token which is returned. (All these solutions, including REJECT, do rescan the text following the selected match if it is shorter than the longest possible extended match. That's theoretically avoidable but since it's not a lot of overhead, it doesn't seem worthwhile to build a complex mechanism to avoid it.)
Again, you probably want to avoid trying to extend token matches after the fallback until you reach the longest extent. This can be solved the same way, by recording the longest matched extent, but the start condition handling is a bit simpler.
Here's some not-very-well tested code for the simpler problem, where only the first identifier and the full match are possible:
%{
/* Code to track current token position and the fallback position.
* It's a simple byte count; line-ends are not dealt with.
*/
static int token_pos = 0;
static int token_max_pos = 0;
/* This is done before every action, even empty actions */
#define YY_USER_ACTION token_pos += yyleng;
/* FALLBACK needs to undo YY_USER_ACTION */
#define FALLBACK(to) do { token_pos -= yyleng - to; yyless(to); } while (0)
/* SET_MORE needs to pre-undo the next YY_USER_ACTION */
#define SET_MORE(X) do { token_pos -= yyleng; yymore(); } while(0)
%}
%x EXTEND_ID
ident [[:alpha:]_][[:alnum:]_]*
%%
int fallback_leng = 0;
/* The trailing context here is to avoid triggering EOF in
* the EXTEND_ID state.
*/
{ident}/[^[:alnum:]_] {
/* In the fallback region, don't attempt to extend the match. */
if (token_pos <= token_max_pos)
return Y_IDENT;
fallback_leng = yyleng;
BEGIN(EXTEND_ID);
SET_MORE(yymore);
}
{ident} { return find_type(yytext) ? Y_TYPE_NAME : Y_IDENT; }
<EXTEND_ID>{
([[:space:]]*"::"[[:space:]]*{ident})*|.|\n {
BEGIN(INITIAL);
if (yyleng == fallback_leng + 1)
FALLBACK(fallback_leng);
if (find_type(yytext))
return Y_TYPE_NAME;
else {
FALLBACK(fallback_leng);
return Y_IDENT;
}
}
}
Notes
At least, I think you can do that. I haven't ever tried and REJECT does impose a number of limitations on other scanner features. Really, it's a historical artefact which massively slows down lexical analysis and should generally be avoided.
The use of REJECT anywhere in the rules causes Flex to switch to a different template in which a list of endpoints is retained, which has a cost in both time and space. Another consequence is that Flex can no longer resize the input buffer.
૮₍ • ᴥ • ₎ა・Raiden ▬▭⋱𓂅
ᘏ⑅ᘏ╭╯Welcome╰╮𓂃ᘏᗢ
・・・・・・・・・・・・・・・・・・・・・
https://discord.gg/rsCC8y7WC4
・・・・・・・・・・・・・・・・・・・・・
Join!
・・・・・・・・・・・・・・・・・・・・・
How can I pull the "discord.gg/rsCC8y7WC4" link from a text like this
console.log(invitelink) ==> discord.gg/rsCC8y7WC4
Use String.match() for this. String.match accepts a regex argument which looks like this:
let str = 'hey there this is just a random string';
let res = str.match(/random/);
//res is now ['random']
Now for your problem, you are probably looking for this:
if(msg.content.match(/discord\.gg\/.+/) || msg.content.match(/discordapp\.com\/invite\/.+/)) return msg.channel.send('Hey! You put an invite in your message!');
Now that regex may look a bit messy/complicated but the \s are to escape the character and make sure regex knows that it’s not the special character it uses, and is actually just a character part of the search. To clarify why the above example should work, here’s a little explanation:
match() returns either an array (if it gets a match) or null (if it gets no match). You are searching for the strings 'discord.gg/' followed by any characters and also checking for the string 'discordapp.com/invite/' also followed by any characters.
If this doesn’t work, please tell me.
I have done some searching but cannot see how to actually code this. I am new to Python and not really sure what method I should use to try to do this.
I have some files that I would like to rename. Unfortunately the portion towards the file extension is never the same and would like to just remove it.
File name is like AC_DC - Shot Down In Flames (Official Video)-UKwVvSleM6w.mp3
Any help would be appreciated.
Since this looks like the result from youtube-dl, the "random" substring is most likely the unique video id, which in my experience is always 11 characters long. It can, however, include dashes (-), so the regex-approach suggested by smitrp would not always work.
I use this "dirty" workaround:
>>> original_name="AC_DC - Shot Down In Flames (Official Video)-UKwVvSleM6w.mp3"
>>> new_name=original_name[:-16]+".mp3"
>>> new_name
'AC_DC - Shot Down In Flames (Official Video).mp3'
Edit:
If you really, REALLY want to find the "-XXXX"-portion, have a look at str.rfind(). This will help you to find the index of the last dash (-), which you can directly use for the slice notation of the string.
Disclaimer:
This will provide wrong results, if the video id contains a dash, e.g. here: https://www.youtube.com/watch?v=7WVBEB8-wa0
Then you will find the last dash, remove -wa0 and be left with -7WVBEB8 at the end of the filename.
Using idea of the above answer, one can also take into account that a normal word does not
contain more than one capital character.
def youtube_name_fix(folder):
import os
from pathlib import Path
import re
REGEX = re.compile(r'[A-Z]')
for name in os.listdir(folder):
basename = Path(name)
last_12 = basename.stem[-12:]
# check if the end string is not all uppercase (then it could be part of a valid name)
if not last_12.isupper():
# check if the last string has more than one uppercase letters
if len(REGEX.findall(last_12)) > 1:
# remove the end youtube string and create new full path
new_name = os.path.join(folder, basename.stem[:-12] + basename.suffix)
try:
os.rename(os.path.join(folder,name), new_name)
except Exception as e:
print(e)
> youtube_name_fix(p)
old name -> "4-Discrete and Continuous Probability Models-esHwigpYggU.mp4"
new name -> "4-Discrete and Continuous Probability Models.mp4"
I'm running into a problem with the __ function, and I'm not sure whether it's a bug in Cake (3.2.8), Aura\Intl, or my code. I've tried the same thing in Cake 1.3, and it works as I expect it to, but it's possible that my expectations are simply that way because that's how it worked in 1.3. :-)
When I am building my menus, I use things like __('Teams'), but I also have pages that use things like __n('Team', 'Teams', count($player->teams)). The i18n shell extracts these into the default.pot separately, so when I translate it to French, it's like this:
msgid "Teams"
msgstr "Équipe"
msgid "Team"
msgid_plural "Teams"
msgstr[0] "Équipe"
msgstr[1] "Équipes"
If I call __('Team'), I correctly get 'Équipe' returned, and if I call __n('Team', 'Teams', $x), I correctly get 'Équipe' or 'Équipes' returned, depending on the value of $x. But if I call __('Teams') I get back
Array
(
[0] => Équipe
[1] => Équipes
)
This is the case even if I eliminate the msgid "Teams" section, leaving only the plural definition.
In Cake 1.3, __('Teams') would simply return 'Équipes'. (Don't know what it might do in 2.x, as I skipped over that entirely.) So, whose bug is this?
You have two Teams message IDs. The problem is that the CakePHP message file parser stores the messages in a key => value fashion, where the message ID is used as the key, resulting in the Teams messages for msgid_plural to override the Teams message from the preceding msgid.
https://github.com/cakephp/cakephp/blob/3.2.8/src/I18n/Parser/PoFileParser.php#L149
https://github.com/cakephp/cakephp/blob/3.2.8/src/I18n/Parser/PoFileParser.php#L172
https://github.com/cakephp/cakephp/blob/3.2.8/src/I18n/Parser/MoFileParser.php#L137-L140
Since gettext seems to be able to handle this, I'd say that it's at least a missing feature (which might be intentional), however it might even be a bug, I can't tell for sure (but I'd tend towards bug). For clarification, open an issue over at GitHub.
To (temporarily) workaround this issue you could use contexts.
Duplicate message definition
This is problematic:
msgid "Teams" <- same string
msgstr "Équipe"
msgid "Team"
msgid_plural "Teams" <- same string
The first means you have or are expecting to have __('Teams') in the application code, which expects to return a string.
The second scenario is going to create ambiguous data when the po file is parsed. The class responsible for converting po files to the array format is the PoFileParser, which contains these lines:
$messages[$singular] = $translation; // <- a string
...
$messages[$key] = $plurals; // <- an array
Where $messages is the array used to lookup translations, indexed by the translation key.
So, the reason for the observed behavior is because this code:
__('Teams');
is going to look for $messages['Teams']
This code:
__n('Team', 'Teams', 2);
is going to look for $messages['Teams'][<index>], and the $messages array is going to contain the parsed data from the plural translation only, which overwrote the "singular" version of the string as it was earlier in the file.
The code-level solution is to ensure that all msgid and msgid_plural keys are unique (since they are essentially, the same thing).
Bad translation definition
You may well find at some point in the future that translations like __('Teams') are very problematic, depending on how that loose word is being used, they are an indicator of bad translation definitions.
To give an example, CakePHP used to have translations of this form in baked output:
...
sprintf(__('Invalid %s', true), 'Team');
...
Which was later changed to:
__('Invalid Team', true)
Because the translation of Invalid %s can change depending on what %s is - so can the translation of %s.
I'm using Lemon as a parser generator, its error handling is the same as yacc's and bison's if you don't know Lemon.
Lemon has an option to define the error token in a set of rules in order to catch parsing errors. The default behavior of the generated parser is to destroy the token causing the error; is there any way to override this behavior so that I can keep the token?
Here's an example to show what's happening: basically I'm appending the tokens for each rule together to reform the input string, here's an example grammar:
input ::= string(A) { printf("%s", A); } // Print the result
string(A) ::= string(B) part(C). { A = append(B, C); }
string(A) ::= part(B). { A = B; }
part(A) ::= NUMBER(B) NAME(C). { A = append(C, B); } // Rearrange the number and name
part(A) ::= error(B). { A = B; } // On error keep the token anyways
On input:
"Username 1234Joseph"
I get output:
"Joseph1234"
Because the text "Username " is junked by the parser in the part(A) ::= error(B) rule, but I really want:
"Username Joseph1234"
as output.
If you can solve this problem in bison or another parser generator I would accept that as an answer :)
With yacc/bison, a parsing error drops the tool into error recovery mode, if possible. It will attempt to discard tokens on its way to a "clean" state.
I'm unable to find a reference for lemon, so I can't show some lemon code to fix this, but with yacc/bison, one would use the rules here.
Namely, you need to adjust your error rule to state that the parser is ok with yyerrok to prevent it from dropping tokens. Next, it will attempt to reread the "bad" token, so you need to clear it with yyclearin. Finally, since the rule attached to your error code contains the contents of your token, you will need to set up a function that adjusts your input stack, by taking the current token contents and creating a new (proper) token with the same contents.
As an example, if a grammar defined as MyOther MyOther saw MyTok MyOther:
stack
MyTok: "the text"
MyOther: "new text"
stack
MyOther: "the text"
MyOther: "new text"
To accomplish this, look into using yybackup. I'm unable to find an alternative method, though yybackup is frowned upon.
It's an old one, but why not...
The grammar must include spaces. At the moment the grammar only allows a sequence of NUMBER NAME tokens (without any space between the tokens).