RegEx for assembly number - c

I'd like to convert an assembly code to C, but i have trouble changing the number formats. It's a bit similar, to this:
C# regex for assembly style hex numbers but my numbers are ending with an "H" like: 00CH, FFH, etc.
The major problem is, that the imput strings are like:
-33H
RAM4END-AVERH-1-1
AVERH+10H+1
1
I'm thinking of sg like (?<prevStuff>)(?<hexa>)(?<nextStuff>) format, in which case i could simply leave the prev and nextStuff and the hexa would be like: 33,[no mach],10,[no match]
I'm kind of new in here, sry for the misunds
Thanks in advance!

What you might be trying to create is a lexical analyzer or lexer. It takes an input string and returns tokens found from a series of rules. You can read more about lexical analysis here.
This regular expression will match the first two numbers:
[+-]?[0-9a-fA-F]+[H]
The third one is not a number but a expression. Using the following rules you can match all the tokens:
[_a-zA-Z][_a-zA-Z0-9] → identifier
[+-][0-9a-fA-F]+H → hexadecimal number
[0-9]+ → decimal number
To convert the input to another language like C, you can create an Abstract Syntax Tree from the lexer output. A syntax tree represents the structure of the code as statements and expressions. Then the AST can be used to emit C code based on the statements found.

So it appears you're looking for "items" that start with a sign (not including operational signs such as a - b but including those such as the second - in a - -b) or a digit and end with a H.
So, once you've split up your expression into components, you should just be able to detect something like:
[+\-0-9][0-9a-fA-F]H
and replace them with the equivalent C value.
For example, here's a good shell script (needs Perl for the regex stuff) to start with as it shows the possibilities:
#!/bin/bash
list='-14H 14H 027H 42 -17 0cH -0dH VAR1 -VAR2H'
for t1 in $list; do
for t2 in $list; do
expr=$t1-$t2
xlat=$(echo $expr | perl -pne 's/\b([0-9a-fA-F][0-9a-fA-F]*)H\b/0x\1/g')
echo ORIG: $expr XLAT: $xlat
done
done
By using the \b word boundary matcher, it doesn't have to worry about the distinction between operators and signs at all.

Related

Split array element delimited with '.'

I am trying to read below CSV file content line by line in Perl.
CSV File Content:
A7777777.A777777777.XXX3604,XXX,3604,YES,9
B9694396.B216905785.YYY0018,YYY,0018,YES,13
C9694396.C216905785.ZZZ0028,ZZZ,0028,YES,16
I am able to split line content using below code and able to verify the content too:
#column_fields1 = split(',', $_);
print $column_fields1[0],"\n";
I am also trying to find the second part on the first column of CSV file (i.e., A777777777 or B216905785 or C216905785) – the first column delimited with . using the below code and I am unable to get it.
Instead, just a new line printed.
my ($v1, $v2, $v3) = split(".", $column_fields1[0]);
print $v2,"\n";
Can someone suggest me how to split the array element and get the above value?
On my functionality, I need the first column value altogether at someplace and just only the second part at someplace.
Below is my code:
use strict;
use warnings;
my $dailybillable_tab_section1_file = "./sql/demanding_01_T.csv";
open(FILE, $dailybillable_tab_section1_file) or die "Could not read from $dailybillable_tab_section1_file, program halting.";
my #column_fields1;
my #column_fields2;
while (<FILE>)
{
chomp;
#column_fields1 = split(',', $_);
print $column_fields1[0],"\n";
my ($v1, $v2, $v3) = split(".",$column_fields1[0]);
print $v2,"\n";
if($v2 ne 'A777777777')
{
…
…
…
}
else
{
…
…
…
}
}
close FILE;
split takes a regex as its first argument. You can pass it a string (as in your code), but the contents of the string will simply be interpreted as a regex at runtime.
That's not a problem for , (which has no special meaning in a regex), but it breaks with . (which matches any (non-newline) character in a regex).
Your attempt to fix the problem with split "\." fails because "\." is identical to ".": The backslash has its normal string escape meaning, but since . isn't special in strings, escaping it has no effect. You can see this by just printing the resulting string:
print "\.\n"; # outputs '.', same as print ".\n";
That . is then interpreted as a regex, causing the problems you have observed.
The normal fix is to just pass a regex to split:
split /\./, $string
Now the backslash is interpreted as part of the regex, forcing . to match itself literally.
If you really wanted to pass a string to split (I'm not sure why you'd want to do that), you could also do it like this:
split "\\.", $string
The first backslash escapes the second backslash, giving a two character string (\.), which when interpreted as a regex means the same thing as /\./.
If you look at the documentation for split(), you'll see it gives the following ways to call the function:
split /PATTERN/,EXPR,LIMIT
split /PATTERN/,EXPR
split /PATTERN/
split
In three of those examples, the first argument to the function is /PATTERN/. That is, split() expects to be given a regular expression which defines how the input string is split apart.
It's very important to realise that this argument is a regex, not a string. Unfortunately, Perl's parser doesn't insist on that. It allows you to use a first argument which looks like a string (as you have done). But no matter how it looks, it's not a string. It's a regex.
So you have confused yourself by using code like this:
split(".",$COLUMN_FIELDS1[0])
If you had made the first argument look like a regex, then you would be more likely to realise that the first argument is a regex and that, therefore, a dot needs to be escaped to prevent it being interpreted as a metacharacter.
split(/\./, $COLUMN_FIELDS1[0])
Update: It's generally accepted among Perl programmers, that variable with upper case names are constants and don't change their values. By using upper case names for standard variables, you are likely to confuse the next person who edits your code (who could well be you in six months time).

bash array with square bracket strings

I want to make an array with string values that have square brackets. but every time I keep getting output unexpected.
selections=()
for i in $choices
do
selections+=("role[${filenames[$i]}]")
done
echo ${selections[#]}
If choices were 1 and 2, and the array filenames[1] and filenames[2] held the values 'A', 'B' I want the selections array to hold the strings role[A], and role[B]
instead the output I get is just roles.
I can make the code you presented produce the output you wanted, or not, depending on the values I assign to variables filenames and choices.
First, I observe that bash indexed arrays are indexed starting at 0, not 1. If you are using the values 1 and 2 as indices into array filenames, and if that is an indexed array with only two elements, then it may be that ${filenames[2]} expands to nothing. This would be the result if you initialize filenames like so:
# NOT WHAT YOU WANT:
filenames=(A B)
Instead, either assign array elements individually, or add a dummy value at index 0:
# Could work:
filenames=('' A B)
Next, I'm suspicious of choices. Since you're playing with arrays, I speculate that you may have initialized choices as an array, like so:
# NOT CONSISTENT WITH YOUR LATER USAGE:
choices=(1 2)
If you expand an array-valued variable without specifying an index, it is as if you specified index 0. With the above initialization, then, $choices would expand to just 1, not 1 2 as you intend. There are two possibilities: either initialize choices as a flat string:
# Could work:
choices='1 2'
or expand it differently:
# or expand it this way:
for i in "${choices[#]}"
. Do not overlook the quotes, by the way: that particular form will expand to one word per array element, but without the quotes the array elements would be subject to word splitting and other expansions (though that's moot for the particular values you're using in this case).
The quoting applies also, in general, to your echo command: if you do not quote the expansion then you have to analyze the code much more carefully to be confident that it will do what you intend in all cases. It will be subject not only to word splitting, but pathname expansion and a few others. In your case, there is a potential for pathname expansion to be performed, depending on the names of the files in the working directory (thanks #CharlesDuffy). It is far safer to just quote.
Anyway, here is a complete demonstration incorporating your code verbatim and producing the output you want:
#!/bin/bash
filenames=('' 'A' 'B')
choices="1 2"
selections=()
for i in $choices
do
selections+=("role[${filenames[$i]}]")
done
echo ${selections[#]}
# better:
# echo "${selections[#]}"
Output:
role[A] role[B]
Finally, as I observed in comments, there is no way that your code could output "roles", as you claim it does, given the inputs (variable values) you claim it has. If that's in fact what you see, then either it is not related to the code you presented at all, or your inputs are different than you claim.

Flex Regular Expression to Identify AWK Regular Expression

I am putting together the last pattern for my flex scanner for parsing AWK source code.
I cannot figure out how to match the regular expressions used in the AWK source code as seen below:
{if ($0 ~ /^\/\// ){ #Match for "//" (Comment)
or more simply:
else if ($0 ~ /^Department/){
where the AWK regular expression is encapsulated within "/ /".
All of the Flex patterns I have tried so far match my entire input file. I have tried changing the precedence of the regex pattern and have found no luck. Help would be greatly appreciated!!
regexing regexen must be a meme somewhere. Anyway, let's give it a try.
A gawk regex consists of:
/
any number of regex components
/
A regex component (simplified form -- Note 1) is one of the following:
any character other than /, [ or \
a \ followed by any single character (we won't get into linefeeds just now, though.
a character class (see below)
Up to here it's easy. Now for the fun part.
A character class is:
[ or [^ or [] or [^] (Note 2)
any number of character class components
]
A character class component is (theoretically, but see below for the gawk bug) one of the following:
any single character other than ] or \ (Note 3)
a \ followed by any single character
a character class
a collation class
A character class is: (Note 5)
[:
a valid class name, which afaik is always a sequence of alpha characters, but it's maybe safer not to make assumptions.
:]
A collation class is mostly unimplemented but partially parsed. You could probably ignore them, because it seems like gawk doesn't get them right yet (Note 4). But for what it's worth:
[.
some multicharacter collation character, like 'ij' in Dutch locale (I think).
.]
or an equivalence class:
[=
some character, or maybe also a multicharacter collation character
=]
An important point is the [/] does not terminate the regex. You don't need to write [\/]. (You don't need to do anything to implement that. I'm just mentioning it.).
Note 1:
Actually, the intepretation of \ and character classes, when we get to them, is a lot more complicated. I'm just describing enough of it for lexing. If you actually want to parse the regexen into their bits and pieces, it's a lot more irritating.
For example, you can specify an arbitrary octet with \ddd or \xHH (eg \203 or \x4F). However, we don't need to care, because nothing in the escape sequence is special, so for lexing purposes it doesn't matter; we'll get the right end of the lexeme. Similary, I didn't bother describing character ranges and the peculiar rules for - inside a character class, nor do I worry about regex metacharacters (){}?*+. at all, since they don't enter into lexing. You do have to worry about [] because it can implicitly hide a / from terminating the regex. (I once wrote a regex parser which let you hide / inside parenthesized expressions, which I thought was cool -- it cuts down a lot on the kilroy-was-here noise (\/) -- but nobody else seems to think this is a good idea.)
Note 2:
Although gawk does \ wrong inside character classes (see Note 3 below), it doesn't require that you use them, so you can still use Posix behaviour. Posix behaviour is that the ] does not terminate the character class if it is the first character in the character class, possibly following the negating ^. The easiest way to deal with this is to let character classes start with any of the four possible sequences, which is summarized as:
\[^?]?
Note 3:
gawk differs from Posix ERE's (Extended Regular Expressions) in that it interprets \ inside a character class as an escape character. Posix mandates that \ loses its special meaning inside character classes. I find it annoying that gawk does this (and so do many other regex libraries, equally annoying.) It's particularly annoying that the gawk info manual says that Posix requires it to do this, when it actually requires the reverse. But that's just me. Anyway, in gawk:
/[\]/]/
is a regular expression which matches either ] or /. In Posix, stripping the enclosing /s out of the way, it would be a regular expression which matches a \ followed by a / followed by a ]. (Both gawk and Posix require that ] not be special when it's not being treated as a character class terminator.)
Note 4:
There's a bug in the version of gawk installed on my machine where the regex parser gets confused at the end of a collating class. So it thinks the regex is terminated by the first second / in:
/[[.a.]/]/
although it gets this right:
/[[:alpha:]/]/
and, of course, putting the slash first always works:
/[/[:alpha:]]/
Note 5:
Character classes and collating classes and friends are a bit tricky to parse because they have two-character terminators. "Write a regex to recognize C /* */ comments" used to be a standard interview question, but I suppose it not longer is. Anyway, here's a solution (for [:...:], but just substitute : for the other punctuation if you want to):
[[]:([^:]|:*[^]:])*:+[]] // Yes, I know it's unreadable. Stare at it a while.
regex could work without "/.../" see the example:
print all numbers starting with 7 from 1-100:
kent$ seq 100|awk '{if($0~"7[0-9]")print}'
70
71
72
73
74
75
76
77
78
79
kent$ awk --version
GNU Awk 3.1.6

MD5 implementation in C for a XML file

I need to implement the MD5 checksum to verify a MD5 checksum in a XML file including all XML tags and which has received from our client. The length of the received MD5 checksum is 32 byte hexadecimal digits.
We need set MD5 Checksum field should be 0 in received XML file prior to checksum calculation and we have to indepandantly calculate and verify the MD5 checksum value in a received XML file.
Our application is implemented in C. Please assist me on how to implement this.
Thanks
This directly depends on the library used for XML parsing. This is tricky however, because you can't embed the MD5 in the XML file itself, for after embedding the checksum inside, unless you do the checksum only from the specific elements. As I understand you receive the MD5 independently? Is it calculated from the whole file, or only the tags/content?
MD5 Public Domain code link - http://www.fourmilab.ch/md5/
XML library for C - http://xmlsoft.org/
Exact solutions depend on the code used.
Based on your comment you need to do the following steps:
load the xml file (possibly even as plain-text) read the MD5
substitute the MD5 in the file with zero, write the file down (or better to memory)
run MD5 on the pure file data and compare it with the value stored before
There are public-domain implementations of MD5 that you should use, instead of writing your own. I hear that Colin Plumb's version is widely used.
Don't reinvent the wheel, use a proven existing solution: http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html
Incidentally that was the first link that came up when I googled "md5 c implementation".
This is rather nasty. The approach suggested seems to imply you need to parse the XML document into something like a DOM tree, find the MD5 checksum and store it for future reference. Then you would replace the checksum with 0 before re-serializing the document and calculating it's MD5 hash. This all sounds doable but potentially tricky. The major difficulty I see is that your new serialization of the document may not be the same as the original one and irrelevant (to XML) differences like the use of single or double quotes around attribute values, added line breaks or even a different encoding will cause the hashs to differ. If you go down this route you'll need to make sure your app and the procedure used to create the document in the first place make the same choices. For this sort of problem canonical XML is the standard solution (http://www.w3.org/TR/xml-c14n).
However, I would do something different. With any luck it should be quite easy to write a regular expression to locate the MD5 hash in the file and replace it with 0. You can then use this to grab the hash and replace with 0 it in the XML file before recalculating the hash. This sidesteps all the possible issues with parsing, changing and re-serializing the XML document. To illustrate I'm going to assume the hash '33d4046bea07e89134aecfcaf7e73015' lives in the XML file like this:
<docRoot xmlns='some-irrelevant-uri>
<myData>Blar blar</myData>
<myExtraData number='1'/>
<docHash MD5='33d4046bea07e89134aecfcaf7e73015' />
<evenMoreOfMyData number='34'/>
</docRoot>
(which I've called hash.xml), that the MD5 should be replaced by 32 zeros (so the hash is correct) and illustrate the procedure on a shell command line using perl, md5 and bash. (Hopefully translating this into C won't be too hard given the existence of regular expression and hashing libraries.)
Breaking down the problem, you first need to be able to find the hash that is in the file:
perl -p -e'if (m#<docHash.+MD5="([a-fA-F0-9]{32})#) {$_ = "$1\n"} else {$_ = ""}' hash.xml
(this works by looking for the start of the MD5 attribute of the docHash element, allowing for possible other attributes, and then grabbing the next 32 hex characters. If it finds them it bungs them in the magic $_ variable, if not it sets $_ to be empty, then the value of $_ gets printed for each line. This results in the string "33d4046bea07e89134aecfcaf7e73015" being printed.)
Then you need to calculate the hash of the the file with the has replaced with zeros:
perl -p -e's#(<docHash.+MD5=)"([a-fA-F0-9]{32})#$1"000000000000000000000000000000#' hash.xml | md5
(where the regular expression is almost the same, but this time the hex characters are replaced by zeros and the whole file is printed. Then the MD5 of this is calculated by piping the result through an md5 hashing program. Putting this together with a bit of bash gives:
if [ `perl -p -e'if (m#<docHash.+MD5="([a-fA-F0-9]{32})#) {$_ = "$1\n"} else {$_ = ""}' hash.xml` = `perl -p -e's#(<docHash.+MD5=)"([a-fA-F0-9]{32})#$1"000000000000000000000000000000#' hash.xml | md5` ] ; then echo OK; else echo ERROR; fi
which executes those two small commands, compares the output and prints "OK" if the outputs match or "ERROR" if they don't. Obviously this is just a simple prototype, and is in the wrong language, I think it illustrates the most straight forward solution.
Incidentally, why do you put the hash inside the XML document? As far as I can see it doesn't have any advantage compared to passing the hash along on a side channel (even something as simple as in a second file called documentname.md5) and makes the hash validation more difficult.
Check out these examples for how to use the XMLDSIG standard with .net
How to: Sign XML Documents with Digital Signatures
How to: Verify the Digital Signatures of XML Documents
You should maybe consider to change the setting for preserving whitespaces.

How to do Variable Substitution with Flex/Lex and Yacc/Bison

Wikipedia's Interpolation Definition
I am just learning flex / bison and I am writing my own shell with it. I am trying to figure out a good way to do variable interpolation. My initial approach to this was to have flex scan for something like ~ for my home directory, or $myVar , and then set what the yyval.stringto what is returned using a look up function. My problem is, that this doesn't help me when text appears one token:
kbsh:/home/kbrandt% echo ~
/home/kbrandt
kbsh:/home/kbrandt% echo ~/foo
/home/kbrandt /foo
kbsh:/home/kbrandt%
The lex definition I have for variables:
\$[a-zA-Z/0-9_]+ {
yylval.string=return_value(&variables, (yytext + sizeof(char)));;
return(WORD);
}
Then in my Grammar, I have things like:
chdir_command:
CD WORD { change_dir($2); }
;
Anyone know of a good way to handle this sort of thing? Am I going about this all wrong?
The way 'traditional' shells deal with things like variable substitution is difficult to handle with lex/yacc. What they do is more like macro expansion, where AFTER expanding a variable, they then re-tokenize the input, without expanding further variables. So for example, an input like "xx${$foo}" where 'foo' is defined as 'bar' and 'bar' is defined as '$y' will expand to 'xx$y' which will be treated as a single word (and $y will NOT be expanded).
You CAN deal with this in flex, but you need a lot of supporting code. You need to use flex's yy_buffer_state stuff to sometimes redirect the output into a buffer that you'll then rescan from, and use start states carefully to control when variables can and can't be expanded.
Its probably easier to use a very simple lexer that returns tokens like ALPHA (one or more alphabetic chars), NUMERIC (one or more digits), or WHITESPACE (one or more space or tab), and have the parser assemble them appropriately, and you end up with rules like:
simple_command: wordlist NEWLINE ;
wordlist: word | wordlist WHITESPACE word ;
word: word_frag
| word word_frag { $$ = concat_string($1, $2); }
;
word_frag: single_quote_string
| double_quote_string
| variable
| ALPHA
| NUMERIC
...more options...
;
variable: '$' name { $$ = lookup($2); }
| '$' '{' word '}' { $$ = lookup($3); }
| '$' '{' word ':' ....
as you can see, this get complex quite fast.
Looks generally OK
I'm not sure what return_value is doing, hopefully it will strdup(3) the variable name, because yytext is just a buffer.
If you are asking about the division of labor between lex and parse, I'm sure it's perfectly reasonable to push the macro processing and parameter substitution into the scanner and just have your grammar deal with WORDs, lists, commands, pipelines, redirections, etc. After all, it would be reasonable enough, albeit kind of out of style and possibly defeating the point of your exercise, to do everything with code.
I do think that making cd or chdir a terminal symbol and using that in a grammar production is...not the best design decision. Just because a command is a built-in doesn't mean it should appear as a rule. Go ahead and parse cd and chdir like any other command. Check for built-in semantics as an action, not a production.
After all, what if it's redefined as a shell procedure?

Resources