Unable to form the required regex in C - c

I am trying to write a regex which can search a string and return true if it matches with the regex and false otherwise.
Check should ensure string is wildcard domain name of a website.
Example:
*.cool.dude is valid
*.cool is not valid
abc.cool.dude is not valid
So I had written something which like this
\\*\\.[.*]\\.[.*]
However, this is also allowing a *.. string as valid string because * means 0 or infinite occurrences.
I am looking for something which ensures that at-least 1 occurrence of the string happens.
Example:
*.a.b -> valid but *.. -> invalid
how to change the regex to support this?
I have already tried doing something like this:
\\*\\.([.*]{1,})\\.([.*]{1,}) -> doesnt work
\\*\\.([.+])\\.(.+) -> doesnt work
^\\*\\.[a-zA-Z]+\\.[a-zA-Z]+ -> doesnt work
I have tried a bunch of other options as well and have failed to find a solution. Would be great if someone can provide some input.
PS. Looking for a solution which works in C.

[.*] does not mean "0 or more occurrences" of anything. It means "a single character, either a (literal) . or a (literal) [*]". […] defines a character class, which matches exactly one character from the specified set. Brackets are not even remotely the same as parentheses.
So if you wanted to express "zero or more of any character except newline", you could just write .*. That's what .* means. And if you wanted "one or more" instead of "zero or more", you could change the * to a plus, as long as you remember that regex.h regexes should always be compiled with the REG_EXTENDED flag. Without that flag, + is just an ordinary character. (And there are a lot of other inconveniences.)
But that's probably not really what you want. My guess is that you want something like:
^[*]([.][A-Za-z0-9_]+){2,}$
although you'll have to correct the character class to specify the precise set of characters you think are legitimate.
Again, don't forget the crucial REG_EXTENDED flag when you call regcomp.
Some notes:
The {2,} requires at least two components after the *, so that *.cool doesn't match.
The ^ and $ at the beginning and end of the regex "anchor" the match to the entire input. That stops the pattern matching just a part of the input, but it might not be exactly what you want, either.
Finally, I deliberately used a single-character character class to force [*] and [.] to be ordinary characters. I find that a lot more readable than falling timber (\\) and it avoids having to think about the combination of string escaping and regex-escaping.
For more information, I highly recommend reading man regcomp and man 7 regex. A good introduction to regexes might be useful, as well.

Related

Extracting the domain extension of a URL stored in a string using scanf()

I am writing a code that takes a URL address as a string literal as input, then runs the domain extension of the URL through an array and returns the index if finds a match, -1 if does not.
For example, an input would be www.stackoverflow.com, in this case, I'd need to extract only the com part. In case of www.google.com.tr, I'd need only com again, ignoring the .tr part.
I can think of basically writing a function that'll do that just fine but I'm wondering if it is possible to do it using scanf() itself?
It's really an overhead to use scanf here. But you can do this to realize something similar
char a[MAXLEN],b[MAXLEN],c[MAXLEN];
scanf("%[^.].%[^.].%[^. \n]",a,b,c);
printf("Desired part is = %s\n",c);
To be sure that formatting is correct you can check whether this scanf call is successful or not. For example:
if( 3 != scanf("%[^.].%[^.].%[^. \n]",a,b,c)){
fprintf(stderr,"Format must be atleast sth.something.sth\n");
exit(EXIT_FAILURE);
}
What is the other way of achieving this same thing. Use fgets to read the whole line and then parse with strtok with delimiters ".". This way you will get parts of it. With fgets you can easily support different kind of rules. Instead of incorporating it in scanf (which will be a bit difficult in error case), you can use fgets,strtok to do the same.
With the solution provided above only the first three parts of the url is being considered. Rest are not parsed. But this is hardly the practical situation. Most the time we have to process the whole information, all the parts of the url (and we don't know how many parts can be there). Then you would be better using fgets/strtok as mentioned above.

Why does this scanf() conversion actually work?

Ah, the age old tale of a programmer incrementally writing some code that they aren't expecting to do anything more than expected, but the code unexpectedly does everything, and correctly, too.
I'm working on some C programming practice problems, and one was to redirect stdin to a text file that had some lines of code in it, then print it to the console with scanf() and printf(). I was having trouble getting the newline characters to print as well (since scanf typically eats up whitespace characters) and had typed up a jumbled mess of code involving multiple conditionals and flags when I decided to start over and ended up typing this:
(where c is a character buffer large enough to hold the entirety of the text file's contents)
scanf("%[a-zA-Z -[\n]]", c);
printf("%s", c);
And, voila, this worked perfectly. I tried to figure out why by creating variations on the character class (between the outside brackets), such as:
[\w\W -[\n]]
[\w\d -[\n]]
[. -[\n]]
[.* -[\n]]
[^\n]
but none of those worked. They all ended up reading either just one character or producing a jumbled mess of random characters. '[^\n]' doesn't work because the text file contains newline characters, so it only prints out a single line.
Since I still haven't figured it out, I'm hoping someone out there would know the answer to these two questions:
Why does "[a-zA-Z -[\nn]]" work as expected?
The text file contains letters, numbers, and symbols (':', '-', '>', maybe some others); if 'a-z' is supposed to mean "all characters from unicode 'a' to unicode 'z'", how does 'a-zA-Z' also include numbers?
It seems like the syntax for what you can enter inside the brackets is a lot like regex (which I'm familiar with from Python), but not exactly. I've read up on what can be used from trying to figure out this problem, but I haven't been able to find any info comparing whatever this syntax is to regex. So: how are they similar and different?
I know this probably isn't a good usage for scanf, but since it comes from a practice problem, real world convention has to be temporarily ignored for this usage.
Thanks!
You are picking up numbers because you have " -[" in your character set. This means all characters from space (32) to open-bracket (91), which includes numbers in ASCII (48-57).
Your other examples include this as well, but they are missing the "a-zA-Z", which lets you pick up the lower-case letters (97-122). Sequences like '\w' are treated as unknown escape sequences in the string itself, so \w just becomes a single w. . and * are taken literally. They don't have a special meaning like in a regular expression.
If you include - inside the [ (other than at the beginning or end) then the behaviour is implementation-defined.
This means that your compiler documentation must describe the behaviour, so you should consult that documentation to see what the defined behaviour is, which would explain why some of your code worked and some didn't.
If you want to write portable code then you can't use - as anything other than matching a hyphen.

How to wildcard search with capture in C?

I'm trying to write a routine in C to capture sequences of characters in a string argument. The matching criteria in addition to characters can have ? meaning exactly one character and * meaning zero or more characters. (lazy).
e.g.
string: ok1ok1234567890
match: *(ok?2*)4*
The result should be the position of the match = 3 and the length of the match = 5
I have tried numerous ways of doing this, have put it aside, come back to it, put it aside again etc. I cannot crack it. It needs to be a purely C solution and able to capture multiple captures.
e.g. (*)(ok??)3(4*)8*
Every solution I come up with works in many cases but not all. I'm hoping someone somewhere might have done this already or have an insight to how it can be done.

Regular expressions in flex has error

I am new in flex and I want to design a scanner using flex.
At this step, I want to make regular expression to match with id, but here are some conditions:
underline can exist in id
you can use _ whenever you want, but if you are using them exactly
consequently it can be at most 2 underlines for example :
a_b_c »»»» true
a___b »»»» false
123abv »»»» false
integers can't be at the beginning of an id
underline can't exist at the end of an id
The regular expression I have written for that is :
(\b(_{0,2}[A-Za-z][0-9A-Za-z]*(_{0,2}[0-9A-Za-z]+)*)\b)
but now I have 2 questions:
Is the regular expression true? I have tested it in rubular.com and I think this is true but I'm not sure?
The other important problem is that when I write this in my flex file, Unfortunately no id is identified. But I can't why it is not recognized
Can anyone please help me?
The problem here is your ID regular expression. You are using \b to match a word boundary, but Flex's regular expressions have no built-in support for matching word boundaries. Other than that, your regular expression is sound. I was able to get your code working using this modified version of yours: _{0,2}[A-Za-z][0-9A-Za-z]*(_{0,2}[0-9A-Za-z]+)*. (I just got rid of the \b's, and some of the parentheses that bothered me).
Unfortunately, this causes a slight problem. Say that you're lexing and run across something like 12_345. Flex will read 12, assume that it found an IC, and then read _. Finding no match, it will print that to stdout, then read 345 as another IC.
In order to avoid this issue (caused by Flex's lack of word boundaries), you could do one of two things:
Create a rule at the end that matches any character (other than whitespace) and make it give an error. This would stop Flex when it got to _ in the example above.
Create a rule at the end that matches any combination of letters, numbers, and underscores ([_0-9A-Za-z]+). If it is matched, give an error. This will cause Flex to return the entire token 12_345 as an error in the above example.
One other problem: The ID regular expression still won't match anything with underscores at the end of it. This means your current regular expression isn't perfect, and you'll need to do some tweaking with it, but now you know not to use the \b symbol. Here is a reference on Flex's regular expression syntax so you can find other things to use/avoid.
I think your requirement is:
Identifiers can use only alphanumeric characters and _
Identifiers cannot start with a number
Identifiers cannot end with an _
Identifiers cannot include more than two consecutive _
(When I first read your question, I thought the last requirement was that identifiers cannot include more than two _, but looking at the proposed regex, I think the version above is more accurate.)
Based on the above, you should be able to use the following two Flex patterns:
([[:alpha:]]|__?[[:alnum:]])(_?_?[[:alnum:]])* { /* Handle an identifier */ }
[[:alpha:]_][[:alnum:]_]* { /* Error */ }
Breaking that down:
([[:alpha:]]|__?[[:alnum:]]) matches an alphabetic character or one or two _ followed by an alphanumeric character.
(_?_?[[:alnum:]])* matches a string of and alphanumeric characters, with a maximum of two before an alphanumeric character.
The second pattern will match anything which starts with an alphabetic character or followed by any number of alphanumerics or . This will match all valid identifiers as well as the sequences which contain too many consecutive or which end with . If both patterns match (that is, a valid identifier), the first one will win, so it will be correctly recognized. The second pattern will consume the entire erroneous identifier, allowing for easier error recovery.
The pattern in the OP doesn't work because flex treats \b as a backspace character (as in C). Flex does not implement word boundary assertions, but in a lexer you almost never need these; the pattern above can be used if necessary.

Parsing a stream of data for control strings

I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.
If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.
The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.

Resources