I am trying to build a scanner for AWK source code using (F)Lex analysis. I have been able to identify AWK keyworkds, comments, string literals, and digits however I am stuck on how to generate regular expressions for matching variable instance names since these are quite dynamic.
Could someone please help me develop a regular expression for matching AWK variables.
http://pubs.opengroup.org/onlinepubs/009695399/utilities/awk.html provides definition for the AWK language.
Variables must start with a letter but can be alphanumerical without regard to case. The only special character that can be used is an underscore ("_"). I apologize I am not very experienced with REGEX let alone regular expressions for FLEX.
Thank you for your help.
[a-zA-Z_][a-zA-Z_0-9]*
Alphabetic or underscore to start, followed by zero or more alphanumerics or underscore.
Special cases will be fields, which are prefixed by $:
$0
$1
and also
$NF
$i
You'll have to decide how you're going to deal with those.
Related
I'd like to convert an assembly code to C, but i have trouble changing the number formats. It's a bit similar, to this:
C# regex for assembly style hex numbers but my numbers are ending with an "H" like: 00CH, FFH, etc.
The major problem is, that the imput strings are like:
-33H
RAM4END-AVERH-1-1
AVERH+10H+1
1
I'm thinking of sg like (?<prevStuff>)(?<hexa>)(?<nextStuff>) format, in which case i could simply leave the prev and nextStuff and the hexa would be like: 33,[no mach],10,[no match]
I'm kind of new in here, sry for the misunds
Thanks in advance!
What you might be trying to create is a lexical analyzer or lexer. It takes an input string and returns tokens found from a series of rules. You can read more about lexical analysis here.
This regular expression will match the first two numbers:
[+-]?[0-9a-fA-F]+[H]
The third one is not a number but a expression. Using the following rules you can match all the tokens:
[_a-zA-Z][_a-zA-Z0-9] → identifier
[+-][0-9a-fA-F]+H → hexadecimal number
[0-9]+ → decimal number
To convert the input to another language like C, you can create an Abstract Syntax Tree from the lexer output. A syntax tree represents the structure of the code as statements and expressions. Then the AST can be used to emit C code based on the statements found.
So it appears you're looking for "items" that start with a sign (not including operational signs such as a - b but including those such as the second - in a - -b) or a digit and end with a H.
So, once you've split up your expression into components, you should just be able to detect something like:
[+\-0-9][0-9a-fA-F]H
and replace them with the equivalent C value.
For example, here's a good shell script (needs Perl for the regex stuff) to start with as it shows the possibilities:
#!/bin/bash
list='-14H 14H 027H 42 -17 0cH -0dH VAR1 -VAR2H'
for t1 in $list; do
for t2 in $list; do
expr=$t1-$t2
xlat=$(echo $expr | perl -pne 's/\b([0-9a-fA-F][0-9a-fA-F]*)H\b/0x\1/g')
echo ORIG: $expr XLAT: $xlat
done
done
By using the \b word boundary matcher, it doesn't have to worry about the distinction between operators and signs at all.
I am putting together the last pattern for my flex scanner for parsing AWK source code.
I cannot figure out how to match the regular expressions used in the AWK source code as seen below:
{if ($0 ~ /^\/\// ){ #Match for "//" (Comment)
or more simply:
else if ($0 ~ /^Department/){
where the AWK regular expression is encapsulated within "/ /".
All of the Flex patterns I have tried so far match my entire input file. I have tried changing the precedence of the regex pattern and have found no luck. Help would be greatly appreciated!!
regexing regexen must be a meme somewhere. Anyway, let's give it a try.
A gawk regex consists of:
/
any number of regex components
/
A regex component (simplified form -- Note 1) is one of the following:
any character other than /, [ or \
a \ followed by any single character (we won't get into linefeeds just now, though.
a character class (see below)
Up to here it's easy. Now for the fun part.
A character class is:
[ or [^ or [] or [^] (Note 2)
any number of character class components
]
A character class component is (theoretically, but see below for the gawk bug) one of the following:
any single character other than ] or \ (Note 3)
a \ followed by any single character
a character class
a collation class
A character class is: (Note 5)
[:
a valid class name, which afaik is always a sequence of alpha characters, but it's maybe safer not to make assumptions.
:]
A collation class is mostly unimplemented but partially parsed. You could probably ignore them, because it seems like gawk doesn't get them right yet (Note 4). But for what it's worth:
[.
some multicharacter collation character, like 'ij' in Dutch locale (I think).
.]
or an equivalence class:
[=
some character, or maybe also a multicharacter collation character
=]
An important point is the [/] does not terminate the regex. You don't need to write [\/]. (You don't need to do anything to implement that. I'm just mentioning it.).
Note 1:
Actually, the intepretation of \ and character classes, when we get to them, is a lot more complicated. I'm just describing enough of it for lexing. If you actually want to parse the regexen into their bits and pieces, it's a lot more irritating.
For example, you can specify an arbitrary octet with \ddd or \xHH (eg \203 or \x4F). However, we don't need to care, because nothing in the escape sequence is special, so for lexing purposes it doesn't matter; we'll get the right end of the lexeme. Similary, I didn't bother describing character ranges and the peculiar rules for - inside a character class, nor do I worry about regex metacharacters (){}?*+. at all, since they don't enter into lexing. You do have to worry about [] because it can implicitly hide a / from terminating the regex. (I once wrote a regex parser which let you hide / inside parenthesized expressions, which I thought was cool -- it cuts down a lot on the kilroy-was-here noise (\/) -- but nobody else seems to think this is a good idea.)
Note 2:
Although gawk does \ wrong inside character classes (see Note 3 below), it doesn't require that you use them, so you can still use Posix behaviour. Posix behaviour is that the ] does not terminate the character class if it is the first character in the character class, possibly following the negating ^. The easiest way to deal with this is to let character classes start with any of the four possible sequences, which is summarized as:
\[^?]?
Note 3:
gawk differs from Posix ERE's (Extended Regular Expressions) in that it interprets \ inside a character class as an escape character. Posix mandates that \ loses its special meaning inside character classes. I find it annoying that gawk does this (and so do many other regex libraries, equally annoying.) It's particularly annoying that the gawk info manual says that Posix requires it to do this, when it actually requires the reverse. But that's just me. Anyway, in gawk:
/[\]/]/
is a regular expression which matches either ] or /. In Posix, stripping the enclosing /s out of the way, it would be a regular expression which matches a \ followed by a / followed by a ]. (Both gawk and Posix require that ] not be special when it's not being treated as a character class terminator.)
Note 4:
There's a bug in the version of gawk installed on my machine where the regex parser gets confused at the end of a collating class. So it thinks the regex is terminated by the first second / in:
/[[.a.]/]/
although it gets this right:
/[[:alpha:]/]/
and, of course, putting the slash first always works:
/[/[:alpha:]]/
Note 5:
Character classes and collating classes and friends are a bit tricky to parse because they have two-character terminators. "Write a regex to recognize C /* */ comments" used to be a standard interview question, but I suppose it not longer is. Anyway, here's a solution (for [:...:], but just substitute : for the other punctuation if you want to):
[[]:([^:]|:*[^]:])*:+[]] // Yes, I know it's unreadable. Stare at it a while.
regex could work without "/.../" see the example:
print all numbers starting with 7 from 1-100:
kent$ seq 100|awk '{if($0~"7[0-9]")print}'
70
71
72
73
74
75
76
77
78
79
kent$ awk --version
GNU Awk 3.1.6
I need to find out if a file or directory name contains any extension in Unix for a Bourne shell scripting.
The logic will be:
If there is a file extension
Remove the extension
And use the file name without the extension
This is my first question in SO so will be great to hear from someone.
The concept of an extension isn't as strictly well-defined as in traditional / toy DOS 8+3 filenames. If you want to find file names containing a dot where the dot is not the first character, try this.
case $filename in
[!.]*.*) filename=${filename%.*};;
esac
This will trim the extension (as per the above definition, starting from the last dot if there are several) from $filename if there is one, otherwise no nothing.
If you will not be processing files whose names might start with a dot, the case is superfluous, as the assignment will also not touch the value if there isn't a dot; but with this belt-and-suspenders example, you can easily pick the approach you prefer, in case you need to extend it, one way or another.
To also handle files where there is a dot, as long as it's not the first character (but it's okay if the first character is also a dot), try the pattern ?*.*.
The case expression in pattern ) commands ;; esac syntax may look weird or scary, but it's quite versatile, and well worth learning.
I would use a shell agnostic solution. Runing the name through:
cut -d . -f 1
will give you everything up to the first dot ('-d .' sets the delimeter and '-f 1' selects the first field). You can play with the params (try '--complement' to reverse selection) and get pretty much anything you want.
I want to write simple program in C equivalent to the regular expression:
/<rr>(.*?)<\/rr>/<test>$1<\/test>/gi.
Does anyone have examples?
It helps if you understand what the regex is supposed to do.
The pattern
The parentheses (...) indicate the beginning and end of a group. They also create a backreference to be used later.
The . is a metacharacter that matches any character.
The * repetition specifier can be used to match "zero-or-more times" of the preceding pattern.
The ? is used here to make the preceding quantifier "lazy" instead of "greedy."
The $1 is likely (depends on the language) a reference to the first capture group. In this case it would be everything matched by (.*?)
The /g modifier at the end is used to perform a global match (find all matches rather than stopping after the first match).
The /i modifier is used to make case-insensitive matches
References
regular-expressions.info, Grouping, Dot, Repetition: *+?{…}
Wikipedia's Interpolation Definition
I am just learning flex / bison and I am writing my own shell with it. I am trying to figure out a good way to do variable interpolation. My initial approach to this was to have flex scan for something like ~ for my home directory, or $myVar , and then set what the yyval.stringto what is returned using a look up function. My problem is, that this doesn't help me when text appears one token:
kbsh:/home/kbrandt% echo ~
/home/kbrandt
kbsh:/home/kbrandt% echo ~/foo
/home/kbrandt /foo
kbsh:/home/kbrandt%
The lex definition I have for variables:
\$[a-zA-Z/0-9_]+ {
yylval.string=return_value(&variables, (yytext + sizeof(char)));;
return(WORD);
}
Then in my Grammar, I have things like:
chdir_command:
CD WORD { change_dir($2); }
;
Anyone know of a good way to handle this sort of thing? Am I going about this all wrong?
The way 'traditional' shells deal with things like variable substitution is difficult to handle with lex/yacc. What they do is more like macro expansion, where AFTER expanding a variable, they then re-tokenize the input, without expanding further variables. So for example, an input like "xx${$foo}" where 'foo' is defined as 'bar' and 'bar' is defined as '$y' will expand to 'xx$y' which will be treated as a single word (and $y will NOT be expanded).
You CAN deal with this in flex, but you need a lot of supporting code. You need to use flex's yy_buffer_state stuff to sometimes redirect the output into a buffer that you'll then rescan from, and use start states carefully to control when variables can and can't be expanded.
Its probably easier to use a very simple lexer that returns tokens like ALPHA (one or more alphabetic chars), NUMERIC (one or more digits), or WHITESPACE (one or more space or tab), and have the parser assemble them appropriately, and you end up with rules like:
simple_command: wordlist NEWLINE ;
wordlist: word | wordlist WHITESPACE word ;
word: word_frag
| word word_frag { $$ = concat_string($1, $2); }
;
word_frag: single_quote_string
| double_quote_string
| variable
| ALPHA
| NUMERIC
...more options...
;
variable: '$' name { $$ = lookup($2); }
| '$' '{' word '}' { $$ = lookup($3); }
| '$' '{' word ':' ....
as you can see, this get complex quite fast.
Looks generally OK
I'm not sure what return_value is doing, hopefully it will strdup(3) the variable name, because yytext is just a buffer.
If you are asking about the division of labor between lex and parse, I'm sure it's perfectly reasonable to push the macro processing and parameter substitution into the scanner and just have your grammar deal with WORDs, lists, commands, pipelines, redirections, etc. After all, it would be reasonable enough, albeit kind of out of style and possibly defeating the point of your exercise, to do everything with code.
I do think that making cd or chdir a terminal symbol and using that in a grammar production is...not the best design decision. Just because a command is a built-in doesn't mean it should appear as a rule. Go ahead and parse cd and chdir like any other command. Check for built-in semantics as an action, not a production.
After all, what if it's redefined as a shell procedure?