Write a regexp string matching function that supports '.', '*' and '.*' [closed] - c

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
The title is pretty explicit and below are couple of samples input/output. Note that the regexps used are supposed to match from beginning to end of the string.
'abc' =~ 'abc' (match)
'abc' =~ 'a*bc' (match)
'aaaaaaabc' =~ 'c*bc' (no match)
'aaaaaaabc' =~ 'a.*bc' (match)
'abbbbaaaaaabc' =~ 'ab*a*b*c' (match)
'abbbbaaaaaabc' =~ 'ab*a*h*bc' (match)
'bbd' =~ 'b*bbd' (match)
'bbd' =~ '.*bbd' (match)
'bbd' =~ '.*cbd' (no match)
'' =~ '.*' (match)
My implementation for this is located at:
https://github.com/jpbillaud/piexposed/blob/master/string/string_match_regexp.c
Now I was wondering if anybody would think about a more interesting way to solve this using DP, Finite Automata or whatever else.

Take a look at this implementation of a regular expression matcher by Rob Pike, taken from the book The Practice of Programming. It's absolutely beautiful code, in just 35 lines of C it happens to meet all the requirements in the question (and a bit more!). Quoting from the article referenced above:
/* match: search for regexp anywhere in text */
int match(char *regexp, char *text)
{
if (regexp[0] == '^')
return matchhere(regexp+1, text);
do { /* must look even if string is empty */
if (matchhere(regexp, text))
return 1;
} while (*text++ != '\0');
return 0;
}
/* matchhere: search for regexp at beginning of text */
int matchhere(char *regexp, char *text)
{
if (regexp[0] == '\0')
return 1;
if (regexp[1] == '*')
return matchstar(regexp[0], regexp+2, text);
if (regexp[0] == '$' && regexp[1] == '\0')
return *text == '\0';
if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text))
return matchhere(regexp+1, text+1);
return 0;
}
/* matchstar: search for c*regexp at beginning of text */
int matchstar(int c, char *regexp, char *text)
{
do { /* a * matches zero or more instances */
if (matchhere(regexp, text))
return 1;
} while (*text != '\0' && (*text++ == c || c == '.'));
return 0;
}

I've never tried to write a regex before, so I figured I'd give it a shot. I elided some of the boring stuff. Here's my (completely untested or compiled) version:
class Regex {
public:
Regex(const string& pattern) {
// Sanity check pattern:
if ((!pattern.empty() && pattern[0] == '*') ||
adjacent_find(pattern.begin(), pattern.end(), both_are_repeats) != pattern.end()) {
// throw exception
}
for (string::const_iterator curr(pattern.begin()), end(pattern.end()); curr != end; ) {
char current_match = *curr;
++curr;
// Fold any number of the following characters that are current_match or '*' into
// a single Node.
int stars = 0, count = 1;
for (; curr != end; ++curr) {
if (*curr == current_match) {
++count;
} else if (*curr == '*') {
++stars;
} else {
break;
}
}
rewritten_pattern_.push_back(Node(current_match, count - stars, stars > 0));
}
}
// We could do this iteratively and avoid a stack overflow, but the recursion solution is
// a lot easier to write, so it's good enough for SO :)
bool matches(const string& value) const {
return matches_internal(value.begin(), value.end(), rewritten_pattern_.begin(), rewritten_pattern_.end());
}
private:
static bool matches_internal(string::const_iterator value_curr,
string::const_iterator value_end,
vector<Node>::const_iterator pattern_curr,
vector<Node>::const_iterator pattern_end) {
for (; pattern_curr != pattern_end; ++pattern_curr) {
// For each pattern Node, we first verify that the required count of letters is there,
// then we handle the repeats, if specified. After this section, value_curr should
// be advanced past the required elements of the Node.
if (distance(value_curr, value_end) < pattern_curr->count) return false;
string::const_iterator current_pattern_count_end = value_curr;
advance(current_pattern_count_end, pattern_curr->count);
if (pattern_curr->value == '.') {
value_curr = current_pattern_count_end;
} else {
for (; value_curr != current_pattern_count_end; ++value_curr) {
if (*value_curr != pattern_curr->value) {
return false;
}
}
}
// We've handled the required charaters, now handle the repeats, if any:
if (pattern_curr->repeats) {
if (pattern_curr->value == '.') {
// Here's the tricky case that will have to involve some backtracking. We aren't sure
// how much of the string the .* should consume, we have to try all potential positions
// and only match if any position matches. Since most regex impls are greedy
// by default, we'll start potentially matching the whole string and move our way backward.
++pattern_curr;
for (string::const_iterator wildcard_match_end = value_end;
wildcard_match_end != value_curr;
--wildcard_match_end) {
if (matches_internal(wildcard_match_end, value_end, pattern_curr, pattern_end)) {
return true;
}
}
return false;
} else {
// If this isn't a wildcard, we can just consume all of the same value.
for (; value_curr != value_end && *value_curr == pattern_curr->value; ++value_curr) {}
}
}
}
// After all the patterns are consumed, we only match if we have consumed the value also.
return value_curr == value_end;
}
static bool both_are_repeats(char i, char j) {
return i == '*' && j == '*';
}
struct Node {
// put constructor and copy constructor here
char value;
int count;
bool repeats;
}
vector<Node> rewritten_pattern_;
};

Related

How to compare String to list of items in c

I saw another post about similar code but they only compared the string to one other string was wondering if this works / if there's a simple more beginner-friendly way I should learn. Thank you
#include <cs50.h>
#include <stdio.h>
int main(void)
{
string user_imput = get_string("Fruit or Vegetable?, ");
if (strcmp(user_imput, "apple"|| "blueberries" || "cherries" || "bananas" ||"grapes" || "oranges" || "watermelon" ||"lemons" == 0));
{
printf("Fuit!, %s\n", user_imput);
}
else if (strcmp(user_imput, "potatoes" || "tomatoe" || "onions" || "carrot" || "bellpepper" || "lettuce" || "cucumbers" || "broccoli" == 0));
{
printf("Vegtable! %s\n", user_imput);
}
else
{
printf("NA");
}
}```
An easy way to do this is to create a function which searches through an array containing your strings. You then loop through comparing each one individually. If found, we return a truthy value right away. If the loop runs its course and finds nothing, we return a falsy value afterwards.
int is_fruit(string food) {
static const string fruits[] = {
"apple", "blueberries", "cherries" /* ... */
};
for (size_t i = 0; i < (sizeof fruits / sizeof fruits[0]); i++)
if (strcmp(fruits[i], food) == 0)
return 1;
return 0;
}
Then your if ... else looks something like:
if (is_fruit(user_input)) {
/* ... */
} else if (is_vegetable(user_input)) {
/* ... */
} else {
/* ... */
}
A good exercise would be to complete the function above (fill out the array), and write a similar one for vegetables. An even better exercise would be to wring a more generic function that takes a string and any array of strings, and deciding if the array contains the string. Its function prototype would look like this:
int string_array_contains(const string array[], size_t length, const string s);
It would work very similarly to the function above, except you would have to pass in the length of the array, since sizeof cannot be used to determine the length of the array argument.
If functions and arrays are too advanced for you at the moment, then simply know that you must compare each string individually using strcmp. Writing this by hand leads to very long, hard to maintain code:
if (strcmp(user_input, "apple") == 0 || strcmp(user_input, "blueberries") == 0 /* ... */) {
/* ... */
}

Waiting for character in string

I am currently working on a project that will be used to test whether an instrument is within tolerance or not. My test equipment will put the DUT (Device Under Test) into a "Test Mode" where it will repeatedly send a string of data every 200ms. I want to receive that data, check is is within tolerance and give it a pass or fail.
My code so far (I've edited a few things out like .h files and some work related bits!):
void GetData();
void CheckData();
char Data[100];
int deviceId;
float a;
float b;
float c;
void ParseString(const char* stringValue)
{
char* token = NULL;
int tokenPlace = 0;
token = strtok((char *) stringValue, ",");
while (token != NULL) {
switch (tokenPlace) {
case 0:
deviceId = atoi(token);
break;
case 1:
a= ((float)atoi(token)) / 10.0f;
break;
case 2:
b= ((float)atoi(token)) / 100.0f;
break;
case 3:
c= ((float)atoi(token)) / 10.0f;
break;
}
tokenPlace++;
token = strtok(NULL, ",");
}
}
void GetData()
{
int x = UART.scanf("%s,",Data);
ParseString(Data);
if (x !=0) {
UART.printf("Device ID = %i\n\r", deviceId);
UART.printf("a= %.1f\n\r", a);
UART.printf("s= %.2f\n\r", b);
UART.printf("c= %.1f\n\n\r", c);
}
if (deviceId <= 2) {
CheckData();
} else {
pc.printf("Device ID not recognised\n\n\r");
}
}
void CheckData()
{
if (a >= 49.9f && a< = 50.1f) {
pc.printf("a Pass\n\r");
} else {
pc.printf("a Fail\n\r");
}
if (b >= 2.08f && b <= 2.12f) {
pc.printf("b Pass\n\r");
} else {
pc.printf("b Fail\n\r");
}
if (c >= 20.0f && c <= 25.0f) {
pc.printf("c Pass\n\n\r");
} else {
pc.printf("c Fail\n\n\r");
}
if (deviceId == 0) {
(routine1);
} else if (deviceId == 1) {
(routine2);
} else if (deviceId == 2) {
(Routine3);
}
}
int main()
{
while(1) {
if(START == 0) {
wait(0.1);
GetData();
}
}
}
And this works absolutely fine. I am only printing the results to a serial terminal so I can check the data is correct to make sure it is passing and failing correctly.
My issue is every now and then the START button happens to be pressed during the time the string is sent and the data can be corrupt, so the deviceId fails and it will say not recognised. This means I then have to press the start button again and have another go. A the moment, it's a rare occurrence but I'd like to get rid of it if possible. I have tried adding a special character at the beginning of the string but this again gets missed sometimes.
Ideally, when the start button is pressed, I would like it to wait for this special character so it knows it is at the beginning of the string, then the data would be read correctly, but I am unsure how to go about it.
I have been unsuccessful in my attempts so far but I have a feeling I am overthinking it and there is a nice easy way to do it. Probably been staring at it too long now!
My microcontroller is STM32F103RB and I am using the STM Nucleo with the mBed IDE as it's easy and convenient to test the code while I work on it.
You can use ParseString to return a status indicating whether a complete string is read or not.
int ParseString(const char* stringValue)
{
/* ... your original code ... */
/* String is complete if 4 tokens are read */
return (tokenPlace == 4);
}
Then in GetData use the ParseString return value to determine whether to skip the string or not.
void GetData()
{
int x = UART.scanf("%s,",Data);
int result = ParseString(Data);
if (!result) {
/* Did not get complete string - just skip processing */
return;
}
/* ... the rest of your original code ... */
}

Unexpected behavior in C

This is the question I'm working on : http://www.geeksforgeeks.org/recursively-remove-adjacent-duplicates-given-string/
Here's my code in Java for one pass :
/*If a character isn't repeating, copy it to str[j].
* Find start and end indices of repeating characters. Recursively call again
* And starting position now should be end+1. Pass j and starting position */
public class removeDuplicates {
public static void main(String[] args)
{
char[] str = {'c','c'};
removeDups(str,0,0,0);
System.out.println(str);
}
public static void removeDups(char[] str,int j, int start,int flag)
{
/*Check if start character is repeating or not. If yes , then loop till you find
* another character. Pass that characters index(new start) into a recursive call*/
if(start == str.length-1)
{
if(flag!=1)
{
str[j] = str[start];
j++;
}
if(j<=str.length-1)
{
str[j] = '0';
}
}
while(start<str.length-1 && str[start]!='0')
{
if(str[start+1]!=str[start])
{
str[j] = str[start];
start++;
j++;
if(start==str.length-1) {removeDups(str,j,start,flag);}
}
else
{
char ref = str[start];
while(str[start]==ref)
{
if(start<str.length-1)
{
start++;
}
else
{
flag =1;
break;
}
}
removeDups(str,j,start,flag);
return;
}
}
}
}
This works as expected. Here I'm just trying to use a 0 instead of \0 character as in C. Now when I translate the code to C
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
void removeDups(char *str,int j, int start,int flag)
{
/*Check if start character is repeating or not. If yes , then loop till you find
* another character. Pass that characters index(new start) into a recursive call*/
if(start == strlen(str)-1)
{
if(flag!=1)
{
str[j] = str[start];
j++;
}
if(j<=strlen(str)-1)
{
str[j] = '\0';
}
}
while(start<strlen(str)-1 && str[start]!='0')
{
if(str[start+1]!=str[start])
{
str[j] = str[start];
start++;
j++;
if(start==strlen(str)-1) {removeDups(str,j,start,flag);}
}
else
{
char ref = str[start];
while(str[start]==ref)
{
if(start<strlen(str)-1)
{
start++;
}
else
{
flag =1;
break;
}
}
removeDups(str,j,start,flag);
return;
}
}
}
int main()
{
char str[] = "abcddcba";
int len =
while()
for(int i=0;str[i]!='\0';i++)
{
printf("%c",str[i]);
}
printf("\n");
}
The above code gives different results as compared to the Java code.Its virtually identical , just that I'm using strlen() instead of str.length(as in Java).
The interesting part is : if I change the portion to
if(j<=strlen(str)-1)
{
str[j] = '\0';
return;
}
it works perfectly. I've just added a return statement to the if statement.
Why is this happening ? Identical code producing different results in C and Java
You are using return statement and subsequently all code below that return is being excluded from running for that iteration.
Also, You may want to understand what is \0 is and how it's different than 0.
Here's link:
What does the \0 symbol mean in a C string?
In C, assigning a character in a string to '\0' changes the length, so strlen() will return a different result after that. In your Java code, you're using an array, and an array length never changes. You're setting the character to '0' instead of '\0', which are two different things, but even if you did set it to '\0', it still wouldn't change the length. I haven't examined your entire code, but this is one obvious thing that would cause different results.

Removing single dot path names in URL in C

I'm making a function in an apache module which is supposed to fix URLs that are thrown at it. Currently I'm trying to remove single dot path names.
For example, if my URL is:
http://example.com/1/./2/./3/./4.php
Then I want the URL to be:
http://example.com/1/2/3/4.php
However I'm stuck with the logic. I'm using pointers in an effort to make this function run as fast as possible. I'm confused at the logic I should apply at the lines with //? added to the end of them.
Can someone give me advice on how to proceed? Even if its some hidden manual online? I searched bing and google for answers with no success.
static long fixurl(char *u){
char u1[10000];
char *u11=u1,*uu1=u;
long ct=0,fx=0;
while (*uu1){
*u11=*uu1;
if (*uu1=='/'){
ct++;
if (ct >=2){
uu1++;
break;
}
} else {
ct=0;
}
}
while (*uu1){
if (*uu1!='/') { //?
if (*uu1!='.') {
*u11=*uu1;
u11++;
} //?
} //?
uu1++;
}
*u11='\0';
strcpy(u,u1);
return fx;
}
You forget to look ahead one character here:
if (*uu1!='/') { //?
if (*uu1!='.') {
– you are checking the same character twice (against a 'not', so it could have some use, but your question marks indicate you are not sure what to do there and further on).
Note that you actually need to look ahead two characters. If you encounter a slash, test the next character for a . and the one after that for another /.
Rather than trying to fix your code (what is fx, the returned value, supposed to be?), I'd rewrite it from scratch to copy from source to dest and skip the offending sections. The continue makes sure that a sequence /1/././2 gets cleansed correctly to just /1/2 – it needs a chance to check the second slash again, so I just throw it back into the loop.
void fixurl (char *theUrl)
{
char *source, *dest;
source = dest = theUrl;
while (*source)
{
if (source[0] == '/' && source[1] == '.' && source[2] == '/')
{
source += 2; /* effectively, 'try again on the next slash' */
} else
{
*dest = *source;
source++;
dest++;
}
}
*dest = 0;
}
(Afterthought:)
Interestingly, adding proper support for removal of /../ is fairly trivial. If you test for that sequence, you should search backwards for the last / before it and reset dest to that position. You'll want to make sure the path is still valid, though.
This code is untested. In short, it is iterating the string (until the next character is the end sign, since if there is no next character, then you can no longer have a problem) and searches for '/'. When it finds one, analyzes the next character and handles it.
static long fixurl(char *u){
char u1[10000];
int currentIndex = 0;
if (*u == '\0') {
return 0;
}
for (; *(u + 1) != '\0'; u++){
if (*u == '/') {
if (*(u + 1) == '/') {
continue;
} else if ((*(u + 1) == '.') && (*(u + 2) == '.')) {
u++;
continue;
}
}
u1[currentIndex] = *u;
}
strcpy(u,u1);
return currentIndex;
}
here is a version of the code that works
Note it will remove all '.' that follow a '/'
However, it does not check for extraneous '/' characters being inserted into the output as the OPs posted code does not make that check.
Notice the proper formatting of the for() statement
Notice the use of meaningful names, removal of code clutter,
inclusion of a few key comments, etc
Notice the literal characters are placed on the left side of a comparison so writing a '=' when it should be '==' is caught by the compiler.
#include <string.h>
long fixurl( char * );
long fixurl(char *rawURL)
{
char cookedURL[10000] = {'\0'}; // assure new string is terminated
int currentIndex = 0;
cookedURL[currentIndex] = rawURL[0];
rawURL++;
for ( ; *rawURL; rawURL++)
{
// if prior saved char was / and current char is .
// then skip current char
if( ( '/' != cookedURL[currentIndex] )
||
( '.' != *rawURL ))
{
// copy input char to out buffer
currentIndex++;
cookedURL[currentIndex] = *rawURL;
}
} // end for
// copy modified URL back to caller's buffer
strcpy(rawURL, cookedURL);
return currentIndex+1; // number of characters in modified buffer
} // end function: fixurl

parsing with recursion - brackets

Could some give me a hint at this problem :
An expression is correct only if it contains parentheses and braces properly closed and no other character, even space. For example, () ({} () ({})) is a correct expression, whereas ({)} is not a correct expression or {} ({})). An empty expression (which does not contain any character) is correct.
Given a string expression determine if the expressions is correct and if is determine the maximum level of nesting. Maximum level of nesting parentheses is the maximum number of one another.
Examples
{}({}){{(({}))}}
answer : 5
{}({})) -1 (because the expression is incorrect)
That's what I've did so far.
#include <stdio.h>
#include <stdlib.h>
FILE *fi, *fo;
int first, er;
void X();
void Y();
void S() {
X();
Y();
}
void X() {
if(first=='{') {
first=fgetc(fi);
X();
if(first=='}')
first=fgetc(fi);
else
er=1;
S();
}
}
void Y() {
if(first=='(') {
first=fgetc(fi);
Y();
if(first==')')
first=fgetc(fi);
else
er=1;
S();
}
}
int main()
{
fi = fopen("brackets.in","r");
fo = fopen("brackets.out","w");
first=fgetc(fi);
S();
if(first!='\n')
er=-1;
fprintf(fo,"%d",er);
fclose(fi);
fclose(fo);
return 0;
}
First off, it helps to think of your problem as a formal grammar.
S = The Language you are testing for
S->
NUL // Empty
SS // S followed by itself.
[ S ] // Case 1
( S ) // Case 2
{ S } // Case 3
Since this grammar only has one symbol (S), you only need one parsing method.
The following code is incomplete but hopefully it gets the idea across.
char curr_char;
int main (void)
{
curr_char = getc();
result = parse_s();
return 0;
}
// Parse the S pattern off input. When this method completes, curr_char points to the character AFTER S.
// Returns recursion count or -1 on fail.
int parse_s()
{
max_count = 0;
while(true)
{
int curr_count = 0;
switch 'curr_char':
{
case '[': // [
int count = parse_s(); // S
if (count == -1) return -1; // The S must be valid
if (curr_char != ']') return -1; // ]
curr_char = getc(); // Advance past the ]
curr_count = count + 1; // This expression is 1 nest greater than its contained S
break;
case '(':
// XXX
break;
case '{':
// XXX
break;
default:
// This is either the SS (find the max of the two), the NUL case (return 0), or an error (return -1)
break;
}
// In the SS case you're gonna have to loop and do something here.
}
return max_count;
}

Resources