equivalent for regexp /<rr>(.*?)</rr>/<test>$1</test>/gi - c

I want to write simple program in C equivalent to the regular expression:
/<rr>(.*?)<\/rr>/<test>$1<\/test>/gi.
Does anyone have examples?

It helps if you understand what the regex is supposed to do.
The pattern
The parentheses (...) indicate the beginning and end of a group. They also create a backreference to be used later.
The . is a metacharacter that matches any character.
The * repetition specifier can be used to match "zero-or-more times" of the preceding pattern.
The ? is used here to make the preceding quantifier "lazy" instead of "greedy."
The $1 is likely (depends on the language) a reference to the first capture group. In this case it would be everything matched by (.*?)
The /g modifier at the end is used to perform a global match (find all matches rather than stopping after the first match).
The /i modifier is used to make case-insensitive matches
References
regular-expressions.info, Grouping, Dot, Repetition: *+?{…}

Related

What does the letter c do in Snowflake?

While using Snowflake, I saw the letter c within a WHERE clause
WHERE unnest_channel IN ('Referral - Merchant','Referral - Whitelabel')c
What does it do? (Some kind of casting?)
The default string is simply c, which specifies:
Case-sensitive matching.
Single-line mode.
No sub-match extraction, except for REGEXP_REPLACE, which always uses sub-match extraction.
POSIX wildcard character . does not match \n newline characters.
When specifying multiple parameters, the string is entered with no spaces or delimiters. For example, ims specifies case-insensitive matching in multi-line mode with POSIX wildcard matching.
If both c and i are included in the parameters string, the one that occurs last in the string dictates whether the function performs case-sensitive or case-insensitive matching. For example, ci specifies case-sensitive matching because the “i” occurs last in the string.

ksh: remove last extension from a multiple extension filename

I have a filename in the format dir1/dir2/filename.txt.org and I like to rename this to dir1/dir2/filename.txt . how can this be done. I tried 'cut' with '.' separator but it also removes .txt
You can try korn shell variable expansion formats, instead of using a subprocess (e.g. cut) . This can be much faster.
example:
var1=dir1/dir2/filename.txt.org
var2=${var1%.*}
If you now print $var2 its value will be dir1/dir2/filename.txt
The % tells it to delete the smallest matching rightmost match for .* (which means anything following the rightmost period character).
${variable%pattern} - return the value of variable without the smallest ending portion that matches pattern.
Other variable expansion formats are available, it is worthwhile to study the docs.

Split array element delimited with '.'

I am trying to read below CSV file content line by line in Perl.
CSV File Content:
A7777777.A777777777.XXX3604,XXX,3604,YES,9
B9694396.B216905785.YYY0018,YYY,0018,YES,13
C9694396.C216905785.ZZZ0028,ZZZ,0028,YES,16
I am able to split line content using below code and able to verify the content too:
#column_fields1 = split(',', $_);
print $column_fields1[0],"\n";
I am also trying to find the second part on the first column of CSV file (i.e., A777777777 or B216905785 or C216905785) – the first column delimited with . using the below code and I am unable to get it.
Instead, just a new line printed.
my ($v1, $v2, $v3) = split(".", $column_fields1[0]);
print $v2,"\n";
Can someone suggest me how to split the array element and get the above value?
On my functionality, I need the first column value altogether at someplace and just only the second part at someplace.
Below is my code:
use strict;
use warnings;
my $dailybillable_tab_section1_file = "./sql/demanding_01_T.csv";
open(FILE, $dailybillable_tab_section1_file) or die "Could not read from $dailybillable_tab_section1_file, program halting.";
my #column_fields1;
my #column_fields2;
while (<FILE>)
{
chomp;
#column_fields1 = split(',', $_);
print $column_fields1[0],"\n";
my ($v1, $v2, $v3) = split(".",$column_fields1[0]);
print $v2,"\n";
if($v2 ne 'A777777777')
{
…
…
…
}
else
{
…
…
…
}
}
close FILE;
split takes a regex as its first argument. You can pass it a string (as in your code), but the contents of the string will simply be interpreted as a regex at runtime.
That's not a problem for , (which has no special meaning in a regex), but it breaks with . (which matches any (non-newline) character in a regex).
Your attempt to fix the problem with split "\." fails because "\." is identical to ".": The backslash has its normal string escape meaning, but since . isn't special in strings, escaping it has no effect. You can see this by just printing the resulting string:
print "\.\n"; # outputs '.', same as print ".\n";
That . is then interpreted as a regex, causing the problems you have observed.
The normal fix is to just pass a regex to split:
split /\./, $string
Now the backslash is interpreted as part of the regex, forcing . to match itself literally.
If you really wanted to pass a string to split (I'm not sure why you'd want to do that), you could also do it like this:
split "\\.", $string
The first backslash escapes the second backslash, giving a two character string (\.), which when interpreted as a regex means the same thing as /\./.
If you look at the documentation for split(), you'll see it gives the following ways to call the function:
split /PATTERN/,EXPR,LIMIT
split /PATTERN/,EXPR
split /PATTERN/
split
In three of those examples, the first argument to the function is /PATTERN/. That is, split() expects to be given a regular expression which defines how the input string is split apart.
It's very important to realise that this argument is a regex, not a string. Unfortunately, Perl's parser doesn't insist on that. It allows you to use a first argument which looks like a string (as you have done). But no matter how it looks, it's not a string. It's a regex.
So you have confused yourself by using code like this:
split(".",$COLUMN_FIELDS1[0])
If you had made the first argument look like a regex, then you would be more likely to realise that the first argument is a regex and that, therefore, a dot needs to be escaped to prevent it being interpreted as a metacharacter.
split(/\./, $COLUMN_FIELDS1[0])
Update: It's generally accepted among Perl programmers, that variable with upper case names are constants and don't change their values. By using upper case names for standard variables, you are likely to confuse the next person who edits your code (who could well be you in six months time).

How to change values of bash array elements without loop

array=(a b c d)
I would like to add a character before each element of the array in order to have this
array=(^a ^b ^c ^d)
An easy way to do that is to loop on array elements and change values one by one
for i in "${#array[#]}"
do
array[i]="^"array[i]
done
But I would like to know if there is any way to do the same thing without looping on the array as I have to do the same instruction on all elements.
Thanks in advance.
Use Parameter Expansion:
array=("${array[#]/#/^}")
From the documentation:
${parameter/pattern/string}
Pattern substitution. The pattern is expanded to produce a pattern just as in pathname
expansion. Parameter is expanded and the longest match of pattern against its value is
replaced with string. If pattern begins with /, all matches of pattern are replaced with
string. Normally only the first match is replaced. If pattern begins with #, it must
match at the beginning of the expanded value of parameter. If pattern begins with %, it
must match at the end of the expanded value of parameter. If string is null, matches of
pattern are deleted and the / following pattern may be omitted. If parameter is # or *,
the substitution operation is applied to each positional parameter in turn, and the expansion is the resultant list. If parameter is an array variable subscripted with # or *, the
substitution operation is applied to each member of the array in turn, and the expansion is
the resultant list.
This way also honor whitespaces in array values:
array=( "${array[#]/#/^}" )
Note, this will FAIL if array was empty and you set previously
set -u
I don't know how to eliminate this issue using short code...

Flex Regular Expression to Identify AWK Regular Expression

I am putting together the last pattern for my flex scanner for parsing AWK source code.
I cannot figure out how to match the regular expressions used in the AWK source code as seen below:
{if ($0 ~ /^\/\// ){ #Match for "//" (Comment)
or more simply:
else if ($0 ~ /^Department/){
where the AWK regular expression is encapsulated within "/ /".
All of the Flex patterns I have tried so far match my entire input file. I have tried changing the precedence of the regex pattern and have found no luck. Help would be greatly appreciated!!
regexing regexen must be a meme somewhere. Anyway, let's give it a try.
A gawk regex consists of:
/
any number of regex components
/
A regex component (simplified form -- Note 1) is one of the following:
any character other than /, [ or \
a \ followed by any single character (we won't get into linefeeds just now, though.
a character class (see below)
Up to here it's easy. Now for the fun part.
A character class is:
[ or [^ or [] or [^] (Note 2)
any number of character class components
]
A character class component is (theoretically, but see below for the gawk bug) one of the following:
any single character other than ] or \ (Note 3)
a \ followed by any single character
a character class
a collation class
A character class is: (Note 5)
[:
a valid class name, which afaik is always a sequence of alpha characters, but it's maybe safer not to make assumptions.
:]
A collation class is mostly unimplemented but partially parsed. You could probably ignore them, because it seems like gawk doesn't get them right yet (Note 4). But for what it's worth:
[.
some multicharacter collation character, like 'ij' in Dutch locale (I think).
.]
or an equivalence class:
[=
some character, or maybe also a multicharacter collation character
=]
An important point is the [/] does not terminate the regex. You don't need to write [\/]. (You don't need to do anything to implement that. I'm just mentioning it.).
Note 1:
Actually, the intepretation of \ and character classes, when we get to them, is a lot more complicated. I'm just describing enough of it for lexing. If you actually want to parse the regexen into their bits and pieces, it's a lot more irritating.
For example, you can specify an arbitrary octet with \ddd or \xHH (eg \203 or \x4F). However, we don't need to care, because nothing in the escape sequence is special, so for lexing purposes it doesn't matter; we'll get the right end of the lexeme. Similary, I didn't bother describing character ranges and the peculiar rules for - inside a character class, nor do I worry about regex metacharacters (){}?*+. at all, since they don't enter into lexing. You do have to worry about [] because it can implicitly hide a / from terminating the regex. (I once wrote a regex parser which let you hide / inside parenthesized expressions, which I thought was cool -- it cuts down a lot on the kilroy-was-here noise (\/) -- but nobody else seems to think this is a good idea.)
Note 2:
Although gawk does \ wrong inside character classes (see Note 3 below), it doesn't require that you use them, so you can still use Posix behaviour. Posix behaviour is that the ] does not terminate the character class if it is the first character in the character class, possibly following the negating ^. The easiest way to deal with this is to let character classes start with any of the four possible sequences, which is summarized as:
\[^?]?
Note 3:
gawk differs from Posix ERE's (Extended Regular Expressions) in that it interprets \ inside a character class as an escape character. Posix mandates that \ loses its special meaning inside character classes. I find it annoying that gawk does this (and so do many other regex libraries, equally annoying.) It's particularly annoying that the gawk info manual says that Posix requires it to do this, when it actually requires the reverse. But that's just me. Anyway, in gawk:
/[\]/]/
is a regular expression which matches either ] or /. In Posix, stripping the enclosing /s out of the way, it would be a regular expression which matches a \ followed by a / followed by a ]. (Both gawk and Posix require that ] not be special when it's not being treated as a character class terminator.)
Note 4:
There's a bug in the version of gawk installed on my machine where the regex parser gets confused at the end of a collating class. So it thinks the regex is terminated by the first second / in:
/[[.a.]/]/
although it gets this right:
/[[:alpha:]/]/
and, of course, putting the slash first always works:
/[/[:alpha:]]/
Note 5:
Character classes and collating classes and friends are a bit tricky to parse because they have two-character terminators. "Write a regex to recognize C /* */ comments" used to be a standard interview question, but I suppose it not longer is. Anyway, here's a solution (for [:...:], but just substitute : for the other punctuation if you want to):
[[]:([^:]|:*[^]:])*:+[]] // Yes, I know it's unreadable. Stare at it a while.
regex could work without "/.../" see the example:
print all numbers starting with 7 from 1-100:
kent$ seq 100|awk '{if($0~"7[0-9]")print}'
70
71
72
73
74
75
76
77
78
79
kent$ awk --version
GNU Awk 3.1.6

Resources