Flex Regular Expression to Identify AWK Regular Expression - c

I am putting together the last pattern for my flex scanner for parsing AWK source code.
I cannot figure out how to match the regular expressions used in the AWK source code as seen below:
{if ($0 ~ /^\/\// ){ #Match for "//" (Comment)
or more simply:
else if ($0 ~ /^Department/){
where the AWK regular expression is encapsulated within "/ /".
All of the Flex patterns I have tried so far match my entire input file. I have tried changing the precedence of the regex pattern and have found no luck. Help would be greatly appreciated!!

regexing regexen must be a meme somewhere. Anyway, let's give it a try.
A gawk regex consists of:
/
any number of regex components
/
A regex component (simplified form -- Note 1) is one of the following:
any character other than /, [ or \
a \ followed by any single character (we won't get into linefeeds just now, though.
a character class (see below)
Up to here it's easy. Now for the fun part.
A character class is:
[ or [^ or [] or [^] (Note 2)
any number of character class components
]
A character class component is (theoretically, but see below for the gawk bug) one of the following:
any single character other than ] or \ (Note 3)
a \ followed by any single character
a character class
a collation class
A character class is: (Note 5)
[:
a valid class name, which afaik is always a sequence of alpha characters, but it's maybe safer not to make assumptions.
:]
A collation class is mostly unimplemented but partially parsed. You could probably ignore them, because it seems like gawk doesn't get them right yet (Note 4). But for what it's worth:
[.
some multicharacter collation character, like 'ij' in Dutch locale (I think).
.]
or an equivalence class:
[=
some character, or maybe also a multicharacter collation character
=]
An important point is the [/] does not terminate the regex. You don't need to write [\/]. (You don't need to do anything to implement that. I'm just mentioning it.).
Note 1:
Actually, the intepretation of \ and character classes, when we get to them, is a lot more complicated. I'm just describing enough of it for lexing. If you actually want to parse the regexen into their bits and pieces, it's a lot more irritating.
For example, you can specify an arbitrary octet with \ddd or \xHH (eg \203 or \x4F). However, we don't need to care, because nothing in the escape sequence is special, so for lexing purposes it doesn't matter; we'll get the right end of the lexeme. Similary, I didn't bother describing character ranges and the peculiar rules for - inside a character class, nor do I worry about regex metacharacters (){}?*+. at all, since they don't enter into lexing. You do have to worry about [] because it can implicitly hide a / from terminating the regex. (I once wrote a regex parser which let you hide / inside parenthesized expressions, which I thought was cool -- it cuts down a lot on the kilroy-was-here noise (\/) -- but nobody else seems to think this is a good idea.)
Note 2:
Although gawk does \ wrong inside character classes (see Note 3 below), it doesn't require that you use them, so you can still use Posix behaviour. Posix behaviour is that the ] does not terminate the character class if it is the first character in the character class, possibly following the negating ^. The easiest way to deal with this is to let character classes start with any of the four possible sequences, which is summarized as:
\[^?]?
Note 3:
gawk differs from Posix ERE's (Extended Regular Expressions) in that it interprets \ inside a character class as an escape character. Posix mandates that \ loses its special meaning inside character classes. I find it annoying that gawk does this (and so do many other regex libraries, equally annoying.) It's particularly annoying that the gawk info manual says that Posix requires it to do this, when it actually requires the reverse. But that's just me. Anyway, in gawk:
/[\]/]/
is a regular expression which matches either ] or /. In Posix, stripping the enclosing /s out of the way, it would be a regular expression which matches a \ followed by a / followed by a ]. (Both gawk and Posix require that ] not be special when it's not being treated as a character class terminator.)
Note 4:
There's a bug in the version of gawk installed on my machine where the regex parser gets confused at the end of a collating class. So it thinks the regex is terminated by the first second / in:
/[[.a.]/]/
although it gets this right:
/[[:alpha:]/]/
and, of course, putting the slash first always works:
/[/[:alpha:]]/
Note 5:
Character classes and collating classes and friends are a bit tricky to parse because they have two-character terminators. "Write a regex to recognize C /* */ comments" used to be a standard interview question, but I suppose it not longer is. Anyway, here's a solution (for [:...:], but just substitute : for the other punctuation if you want to):
[[]:([^:]|:*[^]:])*:+[]] // Yes, I know it's unreadable. Stare at it a while.

regex could work without "/.../" see the example:
print all numbers starting with 7 from 1-100:
kent$ seq 100|awk '{if($0~"7[0-9]")print}'
70
71
72
73
74
75
76
77
78
79
kent$ awk --version
GNU Awk 3.1.6

Related

Regex inside split() method unintended side-effect [duplicate]

$.validator.addMethod('AZ09_', function (value) {
return /^[a-zA-Z0-9.-_]+$/.test(value);
}, 'Only letters, numbers, and _-. are allowed');
When I use somehting like test-123 it still triggers as if the hyphen is invalid. I tried \- and --
Escaping using \- should be fine, but you can also try putting it at the beginning or the end of the character class. This should work for you:
/^[a-zA-Z0-9._-]+$/
Escaping the hyphen using \- is the correct way.
I have verified that the expression /^[a-zA-Z0-9.\-_]+$/ does allow hyphens. You can also use the \w class to shorten it to /^[\w.\-]+$/.
(Putting the hyphen last in the expression actually causes it to not require escaping, as it then can't be part of a range, however you might still want to get into the habit of always escaping it.)
The \- maybe wasn't working because you passed the whole stuff from the server with a string. If that's the case, you should at first escape the \ so the server side program can handle it too.
In a server side string: \\-
On the client side: \-
In regex (covers): -
Or you can simply put at the and of the [] brackets.
Generally with hyphen (-) character in regex, its important to note the difference between escaping (\-) and not escaping (-) the hyphen because hyphen apart from being a character themselves are parsed to specify range in regex.
In the first case, with escaped hyphen (\-), regex will only match the hyphen as in example /^[+\-.]+$/
In the second case, not escaping for example /^[+-.]+$/ here since the hyphen is between plus and dot so it will match all characters with ASCII values between 43 (for plus) and 46 (for dot), so will include comma (ASCII value of 44) as a side-effect.
\- should work to escape the - in the character range. Can you quote what you tested when it didn't seem to? Because it seems to work: http://jsbin.com/odita3
A more generic way of matching hyphens is by using the character class for hyphens and dashes ("\p{Pd}" without quotes). If you are dealing with text from various cultures and sources, you might find that there are more types of hyphens out there, not just one character. You can add that inside the [] expression

add_history problem while trying to make a minishell [duplicate]

I'm using the readline library in C to create a bash-like prompt within bash. When I tried to make the prompt colorful, with color sequences like these, the coloring works great, but the cursor spacing is messed up. The input is wrapped around too early and the wrap-around is to the same line so it starts overwriting the prompt. I thought I should escape the color-sequences with \[ and \] like
readline("\[\e[1;31m$\e[0m\] ")
But that prints the square brackets, and if I escape the backslashes it prints those too. How do I escape the color codes so the cursor still works?
The way to tell readline that a character sequence in a prompt string doesn't actually move the cursor when output to the screen is to surround it with the markers RL_PROMPT_START_IGNORE (currently, this is the character literal '\001' in readline's C header file) and RL_PROMPT_END_IGNORE (currently '\002').
And as #Joachim and #Alter said, use '\033' instead of '\e' for portability.
I found this question when looking to refine the GNU readline prompt in a bash script. Like readline in C code, \[ and \] aren't special but \001 and \002 will work when given literally via the special treatment bash affords quoted words of the form $'string'. I've been here before (and left unsatisfied due to not knowing to combine it with $'…'), so I figured I'd leave my explanation here now that I have a solution.
Using the data provided here, I was able to conclude this result:
C1=$'\001\033[1;34m\002' # blue - from \e[1;34m
C0=$'\001\033[0;0m\002' # reset - from \e[0;0m
while read -p "${C1}myshell>$C0 " -e command; do
echo "you said: $command"
done
This gives a blue prompt that says myshell> and has a trailing space, without colors for the actual command. Neither hitting Up nor entering a command that wraps to the next line will be confused by the non-printing characters.
As explained in the accepted answer, \001 (Start of Heading) and \002 (Start of Text) are the RL_PROMPT_START_IGNORE and RL_PROMPT_END_IGNORE markers, which tell bash and readline not to count anything between them for the purpose of painting the terminal. (Also found here: \033 is more reliable than \e and since I'm now using octal codes anyway, I might as well use one more.)
There seems to be quite the dearth of documentation on this; the best I could find was in perl's documentation for Term::ReadLine::Gnu, which says:
PROMPT may include some escape sequences. Use RL_PROMPT_START_IGNORE to begin a sequence of non-printing characters, and RL_PROMPT_END_IGNORE to end the sequence.

RegEx for assembly number

I'd like to convert an assembly code to C, but i have trouble changing the number formats. It's a bit similar, to this:
C# regex for assembly style hex numbers but my numbers are ending with an "H" like: 00CH, FFH, etc.
The major problem is, that the imput strings are like:
-33H
RAM4END-AVERH-1-1
AVERH+10H+1
1
I'm thinking of sg like (?<prevStuff>)(?<hexa>)(?<nextStuff>) format, in which case i could simply leave the prev and nextStuff and the hexa would be like: 33,[no mach],10,[no match]
I'm kind of new in here, sry for the misunds
Thanks in advance!
What you might be trying to create is a lexical analyzer or lexer. It takes an input string and returns tokens found from a series of rules. You can read more about lexical analysis here.
This regular expression will match the first two numbers:
[+-]?[0-9a-fA-F]+[H]
The third one is not a number but a expression. Using the following rules you can match all the tokens:
[_a-zA-Z][_a-zA-Z0-9] → identifier
[+-][0-9a-fA-F]+H → hexadecimal number
[0-9]+ → decimal number
To convert the input to another language like C, you can create an Abstract Syntax Tree from the lexer output. A syntax tree represents the structure of the code as statements and expressions. Then the AST can be used to emit C code based on the statements found.
So it appears you're looking for "items" that start with a sign (not including operational signs such as a - b but including those such as the second - in a - -b) or a digit and end with a H.
So, once you've split up your expression into components, you should just be able to detect something like:
[+\-0-9][0-9a-fA-F]H
and replace them with the equivalent C value.
For example, here's a good shell script (needs Perl for the regex stuff) to start with as it shows the possibilities:
#!/bin/bash
list='-14H 14H 027H 42 -17 0cH -0dH VAR1 -VAR2H'
for t1 in $list; do
for t2 in $list; do
expr=$t1-$t2
xlat=$(echo $expr | perl -pne 's/\b([0-9a-fA-F][0-9a-fA-F]*)H\b/0x\1/g')
echo ORIG: $expr XLAT: $xlat
done
done
By using the \b word boundary matcher, it doesn't have to worry about the distinction between operators and signs at all.

equivalent for regexp /<rr>(.*?)</rr>/<test>$1</test>/gi

I want to write simple program in C equivalent to the regular expression:
/<rr>(.*?)<\/rr>/<test>$1<\/test>/gi.
Does anyone have examples?
It helps if you understand what the regex is supposed to do.
The pattern
The parentheses (...) indicate the beginning and end of a group. They also create a backreference to be used later.
The . is a metacharacter that matches any character.
The * repetition specifier can be used to match "zero-or-more times" of the preceding pattern.
The ? is used here to make the preceding quantifier "lazy" instead of "greedy."
The $1 is likely (depends on the language) a reference to the first capture group. In this case it would be everything matched by (.*?)
The /g modifier at the end is used to perform a global match (find all matches rather than stopping after the first match).
The /i modifier is used to make case-insensitive matches
References
regular-expressions.info, Grouping, Dot, Repetition: *+?{…}

How to do Variable Substitution with Flex/Lex and Yacc/Bison

Wikipedia's Interpolation Definition
I am just learning flex / bison and I am writing my own shell with it. I am trying to figure out a good way to do variable interpolation. My initial approach to this was to have flex scan for something like ~ for my home directory, or $myVar , and then set what the yyval.stringto what is returned using a look up function. My problem is, that this doesn't help me when text appears one token:
kbsh:/home/kbrandt% echo ~
/home/kbrandt
kbsh:/home/kbrandt% echo ~/foo
/home/kbrandt /foo
kbsh:/home/kbrandt%
The lex definition I have for variables:
\$[a-zA-Z/0-9_]+ {
yylval.string=return_value(&variables, (yytext + sizeof(char)));;
return(WORD);
}
Then in my Grammar, I have things like:
chdir_command:
CD WORD { change_dir($2); }
;
Anyone know of a good way to handle this sort of thing? Am I going about this all wrong?
The way 'traditional' shells deal with things like variable substitution is difficult to handle with lex/yacc. What they do is more like macro expansion, where AFTER expanding a variable, they then re-tokenize the input, without expanding further variables. So for example, an input like "xx${$foo}" where 'foo' is defined as 'bar' and 'bar' is defined as '$y' will expand to 'xx$y' which will be treated as a single word (and $y will NOT be expanded).
You CAN deal with this in flex, but you need a lot of supporting code. You need to use flex's yy_buffer_state stuff to sometimes redirect the output into a buffer that you'll then rescan from, and use start states carefully to control when variables can and can't be expanded.
Its probably easier to use a very simple lexer that returns tokens like ALPHA (one or more alphabetic chars), NUMERIC (one or more digits), or WHITESPACE (one or more space or tab), and have the parser assemble them appropriately, and you end up with rules like:
simple_command: wordlist NEWLINE ;
wordlist: word | wordlist WHITESPACE word ;
word: word_frag
| word word_frag { $$ = concat_string($1, $2); }
;
word_frag: single_quote_string
| double_quote_string
| variable
| ALPHA
| NUMERIC
...more options...
;
variable: '$' name { $$ = lookup($2); }
| '$' '{' word '}' { $$ = lookup($3); }
| '$' '{' word ':' ....
as you can see, this get complex quite fast.
Looks generally OK
I'm not sure what return_value is doing, hopefully it will strdup(3) the variable name, because yytext is just a buffer.
If you are asking about the division of labor between lex and parse, I'm sure it's perfectly reasonable to push the macro processing and parameter substitution into the scanner and just have your grammar deal with WORDs, lists, commands, pipelines, redirections, etc. After all, it would be reasonable enough, albeit kind of out of style and possibly defeating the point of your exercise, to do everything with code.
I do think that making cd or chdir a terminal symbol and using that in a grammar production is...not the best design decision. Just because a command is a built-in doesn't mean it should appear as a rule. Go ahead and parse cd and chdir like any other command. Check for built-in semantics as an action, not a production.
After all, what if it's redefined as a shell procedure?

Resources