RegEx for matching two letters with special boundaries - c

I want to make sure user input has:
Two letters at the start
And the support for any number of optional space characters following these two letters.
Additionally, if at least one space character is provided, optionally allow letters, digits or . characters after it.
Here's the expression I currently have:
[a-zA-Z][a-zA-Z] (?\\s+ (?a-zA-Z0-9.))
And here's my thinking:
[a-zA-Z][a-zA-Z] makes sure the input begins with at least two letters
(?\\s+ begins an optional statement. This optional statement must start with at least one space (I'm on windows which is why I have two slashes).
(?a-zA-Z0-9.)) finishes the optional statement. So, if at least one space is provided, at least one optional character, number or . can also be added.
For instance, ab, ab , ab .s, and ab .asd2 should all be valid inputs.
How do I solve this problem?

The problem with your attempt is that both (?\ and (?a are syntax errors. If you want to create an optional group, you need to write (...)?, not (?...).
(The other issue is that a-zA-Z0-9 in your regex matches literally because it's not part of a character class.)
Besides, \s (to match whitespace) does not exist in POSIX regex.
My suggestion:
^[a-zA-Z]{2}( +[a-zA-Z0-9.]*)?$
That is:
^ # beginning of string
[a-zA-Z]{2} # exactly two letters
(
\ + # one or more spaces
[a-zA-Z0-9.]* # zero or more of: letters, digits, or dot
)? # ... this group is optional
$ # end of string

Related

ksh: remove last extension from a multiple extension filename

I have a filename in the format dir1/dir2/filename.txt.org and I like to rename this to dir1/dir2/filename.txt . how can this be done. I tried 'cut' with '.' separator but it also removes .txt
You can try korn shell variable expansion formats, instead of using a subprocess (e.g. cut) . This can be much faster.
example:
var1=dir1/dir2/filename.txt.org
var2=${var1%.*}
If you now print $var2 its value will be dir1/dir2/filename.txt
The % tells it to delete the smallest matching rightmost match for .* (which means anything following the rightmost period character).
${variable%pattern} - return the value of variable without the smallest ending portion that matches pattern.
Other variable expansion formats are available, it is worthwhile to study the docs.

To replace the whitespaces of string given string in c

I want to replace the whitespaces with string "IIT".I tried using loop in my string and when I encountered whitespace I tried to replace it with the given string. But the whitespace is similar to a single character in string so it is not replacing with a word so please help me out how can I replace my withspace with given word.Thank you .
The trick to replacing a single character in a string with multiple characters without using a second string is to process the string from end to beginning.
First, go through the string once, counting how many characters there are to be replaced. Then compute how many extra characters your replacement will add. Make sure the string has enough space allocated to handle the new characters. Then starting with the last character in the string, move each character to the new end of the string, replacing specific characters with your replacement characters.
Example, replace x with zz
xcfdxdfxg---
(dashes are space allocated for the string, but not currently used, and of course there should be a \0 at the end of the string, which also properly gets moved)
xcfdxdfxg---
xcfdxdfx---g
xcfdxdf--zzg
xcfdxd--fzzg
xcfdx--dfzzg
xcfd-zzdfzzg
xcf-dzzdfzzg
xc-fdzzdfzzg
x-cfdzzdfzzg
zzcfdzzdfzzg
C String manipulation Standard Library APIs are not that strong to simply replace a Strings. So, You can use lexical analyzer utilities like Flex which give a REGEX power to find and manipulate your texts.
Here is a program which compresses multiple blanks and tabs down to a single blank, and throws away whitespace found at the end of a line:
%%
[ \t]+ putchar( ' ' );
[ \t]+$ /* ignore this token */
Flex will generate a C program for you, which do all the work.
Tutorial: http://alumni.cs.ucr.edu/~lgao/teaching/flex.html
You cannot since you said that you want to replace the whitespaces with "IIT". Here "IIT" has 3 bytes and the whitespace is a single byte. So how can you store it. You can do this by allocating more memory before placing the string "IIT" . See realloc for more information on this.

C How do i specify a POSIX regex that begins in a blank line and ends in a blank line?

I am trying to write code to scan a file and produce a "match!" message when the tool reads a certain line of code preceded and followed by blank lines. The line I am interested in matching is:
Appliance Version 3.1.2
Using regex.h, I have a simple tool that compiles my regex pattern then executes it against every line in the file to search for a match. The basic functionality of the tool is fine: I am able to get it to successfully search for various regex matches. Trouble arises when I try to match a regex containing a blank line before and after the above line of text. Here is my precompiled regex:
[[:space:]]+\n^Appliance Version [[:alnum:]]$\n
I have tried a series of different combinations similar to this, and nothing seems to work. I think it might have to do with \n in which case I would need to figure out a new way to specify the two blank lines. Any insight of POSIX regex would be greatly appreciated!
Looking at your regex, it looks like it is trying to match
Appliance Version [[:alnum:]]
at the end of a line ($). That would be matched by
Appliance Version 3
(3 is an instance of [:alnum:]), but not by
Appliance version 33
([[:alnum:]] only matches one character), and much less by
Appliance version 3.1.2
(the above problem, and also . is not an instance of [:alnum:])
So at a minimum you need to change [[:alnum:]] to [.[:alnum:]]* (or some such).
In addition, your use of ^ and $ is redundant with the explicit \n, but nothing in the regex requires the match to be preceded or followed by a blank line. For example, [[:space:]]\n would happily be matched with the line:
Not a blank line, but with a blank at the end: \n
(where I've written the \n explicitly to show the blank character at the end of the line.)
Matching blank lines
A single blank line is matched with ^[[:space:]]*$. That does not match the newlines at either end. If you want to match a blank line before something, use: ^[[:space:]]*\nSOMETHING. To match a blank line after something: SOMETHING\n[[:space:]]*$. Or, if you really want a blank line before and after: ^[[:space:]]*\nSOMETHING\n[[:space:]]*$. (But that won't match if SOMETHING happens to be the first line of the input, for example. Or the last line.)
As #rici notes, you cannot combine \n^ to match two blank lines -- the markers ^ and $ match a position, not a literal \n character.
To match a blank line, use \n\n, or -- better because you probably don't want to do anything with the hard return that ends the line above, (?<=\n)\n at the start. You can leave the \n\n at the end, though.

Lexical analyzer output issue

This is my lexical analyzer code when I enter as an input the following :
/*This is an example */
program
var a,b:integer;
begin
a =2;
b =a+5;
write(a);
if b==1 then write(a);
end
the output must be like this :
<res,program>
<res,var> <id,a>,<id,b>:<res,integer>;
<res,begin>
<id,a> <assign,=><num,2>;
<id,b> <assign,=><id,a><addop,+><num,5>;
<res,write>(<id,a>);
<res,if> <id,b><relop,==><num,1> <res,then> <res,write>(<id,a>);
<res,end>
but I my output is :
Lexical Error~/hedor1>exampler < input\ .txt
<res,program><res,var><id,a>,<id,b>:<res,integer>;<res,begin><id,a><assign,=><num,2>;<id,b><assign,=><id,a><addop,+><num,5>;<res,write>(<id,a>);<res,if><id,b><relop,==><num,1><res,then><res,write>(<id,a>);<res,end>
I don't know why it just avoids the newline and doesnot print it to the output although I have defined that in my patterns section \n printf("\n");
what is the problem?
Nowhere in your input do you have a single newline by itself. All you have are sequences of one or more whitespace characters (spaces, tabs and newlines). Since you have a rule that matches that, Flex uses the longest match.
Flex generates a greedy parser, which tries to match as much of the input as possible. For example, if it sees the input reality, it doesn't stop after matching real and then go on and match ity as a separate token. Instead, it matches all of reality.
In the same way, in your input after the starting comment you have not one but two newlines (since there is an empty line there), and this will be matched by your {whitespace}+ rule, instead of twice by the \n rule.

Flex Regular Expression to Identify AWK Regular Expression

I am putting together the last pattern for my flex scanner for parsing AWK source code.
I cannot figure out how to match the regular expressions used in the AWK source code as seen below:
{if ($0 ~ /^\/\// ){ #Match for "//" (Comment)
or more simply:
else if ($0 ~ /^Department/){
where the AWK regular expression is encapsulated within "/ /".
All of the Flex patterns I have tried so far match my entire input file. I have tried changing the precedence of the regex pattern and have found no luck. Help would be greatly appreciated!!
regexing regexen must be a meme somewhere. Anyway, let's give it a try.
A gawk regex consists of:
/
any number of regex components
/
A regex component (simplified form -- Note 1) is one of the following:
any character other than /, [ or \
a \ followed by any single character (we won't get into linefeeds just now, though.
a character class (see below)
Up to here it's easy. Now for the fun part.
A character class is:
[ or [^ or [] or [^] (Note 2)
any number of character class components
]
A character class component is (theoretically, but see below for the gawk bug) one of the following:
any single character other than ] or \ (Note 3)
a \ followed by any single character
a character class
a collation class
A character class is: (Note 5)
[:
a valid class name, which afaik is always a sequence of alpha characters, but it's maybe safer not to make assumptions.
:]
A collation class is mostly unimplemented but partially parsed. You could probably ignore them, because it seems like gawk doesn't get them right yet (Note 4). But for what it's worth:
[.
some multicharacter collation character, like 'ij' in Dutch locale (I think).
.]
or an equivalence class:
[=
some character, or maybe also a multicharacter collation character
=]
An important point is the [/] does not terminate the regex. You don't need to write [\/]. (You don't need to do anything to implement that. I'm just mentioning it.).
Note 1:
Actually, the intepretation of \ and character classes, when we get to them, is a lot more complicated. I'm just describing enough of it for lexing. If you actually want to parse the regexen into their bits and pieces, it's a lot more irritating.
For example, you can specify an arbitrary octet with \ddd or \xHH (eg \203 or \x4F). However, we don't need to care, because nothing in the escape sequence is special, so for lexing purposes it doesn't matter; we'll get the right end of the lexeme. Similary, I didn't bother describing character ranges and the peculiar rules for - inside a character class, nor do I worry about regex metacharacters (){}?*+. at all, since they don't enter into lexing. You do have to worry about [] because it can implicitly hide a / from terminating the regex. (I once wrote a regex parser which let you hide / inside parenthesized expressions, which I thought was cool -- it cuts down a lot on the kilroy-was-here noise (\/) -- but nobody else seems to think this is a good idea.)
Note 2:
Although gawk does \ wrong inside character classes (see Note 3 below), it doesn't require that you use them, so you can still use Posix behaviour. Posix behaviour is that the ] does not terminate the character class if it is the first character in the character class, possibly following the negating ^. The easiest way to deal with this is to let character classes start with any of the four possible sequences, which is summarized as:
\[^?]?
Note 3:
gawk differs from Posix ERE's (Extended Regular Expressions) in that it interprets \ inside a character class as an escape character. Posix mandates that \ loses its special meaning inside character classes. I find it annoying that gawk does this (and so do many other regex libraries, equally annoying.) It's particularly annoying that the gawk info manual says that Posix requires it to do this, when it actually requires the reverse. But that's just me. Anyway, in gawk:
/[\]/]/
is a regular expression which matches either ] or /. In Posix, stripping the enclosing /s out of the way, it would be a regular expression which matches a \ followed by a / followed by a ]. (Both gawk and Posix require that ] not be special when it's not being treated as a character class terminator.)
Note 4:
There's a bug in the version of gawk installed on my machine where the regex parser gets confused at the end of a collating class. So it thinks the regex is terminated by the first second / in:
/[[.a.]/]/
although it gets this right:
/[[:alpha:]/]/
and, of course, putting the slash first always works:
/[/[:alpha:]]/
Note 5:
Character classes and collating classes and friends are a bit tricky to parse because they have two-character terminators. "Write a regex to recognize C /* */ comments" used to be a standard interview question, but I suppose it not longer is. Anyway, here's a solution (for [:...:], but just substitute : for the other punctuation if you want to):
[[]:([^:]|:*[^]:])*:+[]] // Yes, I know it's unreadable. Stare at it a while.
regex could work without "/.../" see the example:
print all numbers starting with 7 from 1-100:
kent$ seq 100|awk '{if($0~"7[0-9]")print}'
70
71
72
73
74
75
76
77
78
79
kent$ awk --version
GNU Awk 3.1.6

Resources