Count number of matches using regex.h in C - c

I'm using the POSIX regular expressions regex.h in C to count the number of appearances of a phrase in an English-language text fragment.
But the return value of regexec(...) only tells if a match was found or not. So I tried to use the nmatch and matchptr to find distinct appearances, but when I printed out the matches from matchptr, I just received the first index of first phrase appear in my text.
Here is my code:
#include <sys/types.h>
#include <regex.h>
#include <stdio.h>
#define MAX_MATCHES 20 //The maximum number of matches allowed in a single string
void match(regex_t *pexp, char *sz) {
regmatch_t matches[MAX_MATCHES];
if (regexec(pexp, sz, MAX_MATCHES, matches, 0) == 0) {
for(int i = 0; i < MAX_MATCHES; i++)
printf("\"%s\" matches characters %d - %d\n", sz, matches[i].rm_so, matches[i].rm_eo);
}
else {
printf("\"%s\" does not match\n", sz);
}
}
int main(int argc, char* argv[]) {
int rv;
regex_t exp;
rv = regcomp(&exp, "(the)", REG_EXTENDED | REG_ICASE);
if (rv != 0) {
printf("regcomp failed\n");
}
match(&exp, "the cat is in the bathroom.");
regfree(&exp);
return 0;
}
How can I make this code to report both of the two distinct matches of regular expression (the) in the string the cat is in the bathroom?

You've understood the meaning of pmatch incorrectly. It is not used for getting repeated pattern matches. It is used to get the location of the one match and its possible subgroups. As Linux manual for regcomp(3) says:
The offsets of the subexpression starting at the ith open
parenthesis are stored in pmatch[i]. The entire regular expression's match addresses are stored in
pmatch[0]. (Note that to return the offsets of N subexpression matches, nmatch must be at least N+1.)
Any unused structure elements will contain the value -1.
If you have the regular expression this (\w+) costs (\d+) USD, there are 2 capturing groups in parentheses (\w+) and (\d+); now if nmatch was set to at least 3, pmatch[0] would contain the start and end indices of the whole match, pmatch[1] start and end for the (\w+) group and pmatch[2] for the (\d+) group.
The following code should print the ranges of consecutive matches, if any, or the string "<the input string>" does not contain a match if the pattern never matches.
It is carefully constructed so that it works for a zero-length regular expression as well (an empty regular expression, or say regular expression #? will match at each character position including after the last character; 28 matches of that regular expression would be reported for input the cat is in the bathroom.)
#include <sys/types.h>
#include <regex.h>
#include <stdio.h>
#include <string.h>
void match(regex_t *pexp, char *sz) {
// we just need the whole string match in this example
regmatch_t whole_match;
// we store the eflags in a variable, so that we can make
// ^ match the first time, but not for subsequent regexecs
int eflags = 0;
int match = 0;
size_t offset = 0;
size_t length = strlen(sz);
while (regexec(pexp, sz + offset, 1, &whole_match, eflags) == 0) {
// do not let ^ match again.
eflags = REG_NOTBOL;
match = 1;
printf("range %zd - %zd matches\n",
offset + whole_match.rm_so,
offset + whole_match.rm_eo);
// increase the starting offset
offset += whole_match.rm_eo;
// a match can be a zero-length match, we must not fail
// to advance the pointer, or we'd have an infinite loop!
if (whole_match.rm_so == whole_match.rm_eo) {
offset += 1;
}
// break the loop if we've consumed all characters. Note
// that we run once for terminating null, to let
// a zero-length match occur at the end of the string.
if (offset > length) {
break;
}
}
if (! match) {
printf("\"%s\" does not contain a match\n", sz);
}
}
int main(int argc, char* argv[]) {
int rv;
regex_t exp;
rv = regcomp(&exp, "(the)", REG_EXTENDED | REG_ICASE);
if (rv != 0) {
printf("regcomp failed\n");
}
match(&exp, "the cat is in the bathroom.");
regfree(&exp);
return 0;
}
P.S., the parentheses in your regex (the) are unnecessary in this case; you could just write the (and your initial confusion of getting 2 matches at same position was because you'd get one match for (the) and one submatch for the, had you not have had these parentheses, your code would have printed the location of first match only once).

Related

list convertion in C

I am trying to make put command line arguments by the user into an array but I am unsure how to approach it.
For example say I ran my program like this.
./program 1,2,3,4,5
How would I store 1 2 3 4 5 without the commas, and allow it to be passed to other functions to be used. I'm sure this has to do with using argv.
PS: NO space-separated, I want the numbers to parse into integers, I have an array of 200, and I want these numbers to be stored in the array as, arr[0] = 1, arr[1] = 2....
store 1 2 3 4 5 without the commas, and allow it to be passed to other functions to be used.
PS: NO space-separated, I want the numbers to parse into integers
Space or comma-separated doesn't matter. Arguments always come in as strings. You will have to do the work to turn them into integers using atoi (Ascii-TO-Integer).
Using spaces between arguments is the normal convention: ./program 1 2 3 4 5. They come in already separated in argv.
Loop through argv (skipping argv[0], the program name) and run them through atoi.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
for(int i = 1; i < argc; i++) {
int num = atoi(argv[i]);
printf("%d: %d\n", i, num);
}
}
Using commas is going to make that harder. You first have to split the string using the kind of weird strtok (STRing TOKenizer). Then again call atoi on the resulting values.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char *argv[]) {
char *token = strtok(argv[1], ",");
while(token) {
int num = atoi(token);
printf("%d\n", num);
token = strtok(NULL, ",");
}
}
This approach is also more fragile than taking them as individual arguments. If the user types ./program 1, 2, 3, 4, 5 only 1 will be read.
One of the main disadvantages to using atoi() is it provides no check on the string it is processing and will happily accept atoi ("my-cow"); and silently fail returning 0 without any indication of a problem. While a bit more involved, using strtol() allows you to determine what failed, and then recover. This can be as simple or as in-depth a recovery as your design calls for.
As mentioned in the comment, strtol() was designed to work through a string, converting sets of digits found in the string to a numeric value. On each call it will update the endptr parameter to point to the next character in the string after the last digit converted (to each ',' in your case -- or the nul-terminating character at the end). man 3 strtol provides the details.
Since strtol() updates endptr to the character after the last digit converted, you check if nptr == endptr to catch the error when no digits were converted. You check errno for a numeric conversion error such as overflow. Lastly, since the return type is long you need to check if the value returned is within the range of an int before assigning to your int array.
Putting it altogether with a very minimal bit of error handling, you could do something like:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <errno.h>
#define NELEM 200 /* if you need a constant, #define one (or more) */
int main (int argc, char **argv) {
int arr[NELEM] = {0}, ndx = 0; /* array and index */
char *nptr = argv[1], *endptr = nptr; /* nptr and endptr */
if (argc < 2) { /* if no argument, handle error */
fputs ("error: no argument provided.\n", stderr);
return 1;
}
else if (argc > 2) { /* warn on more than 2 arguments */
fputs ("warning: more than one argument provided.\n", stdout);
}
while (ndx < NELEM) { /* loop until all ints processed or arr full */
int error = 0; /* flag indicating error occured */
long tmp = 0; /* temp var to hold strtol return */
char *onerr = NULL; /* pointer to next comma after error */
errno = 0; /* reset errno */
tmp = strtol (nptr, &endptr, 0); /* attempt conversion to long */
if (nptr == endptr) { /* no digits converted */
fputs ("error: no digits converted.\n", stderr);
error = 1;
onerr = strchr (endptr, ',');
}
else if (errno) { /* overflow in conversion */
perror ("strtol conversion error");
error = 1;
onerr = strchr (endptr, ',');
}
else if (tmp < INT_MIN || INT_MAX < tmp) { /* check in range of int */
fputs ("error: value outside range of int.\n", stderr);
error = 1;
onerr = strchr (endptr, ',');
}
if (!error) { /* error flag not set */
arr[ndx++] = tmp; /* assign integer to arr, advance index */
}
else if (onerr) { /* found next ',' update endptr to next ',' */
endptr = onerr;
}
else { /* no next ',' after error, break */
break;
}
/* if at end of string - done, break loop */
if (!*endptr) {
break;
}
nptr = endptr + 1; /* update nptr to 1-past ',' */
}
for (int i = 0; i < ndx; i++) { /* output array content */
printf (" %d", arr[i]);
}
putchar ('\n'); /* tidy up with newline */
}
Example Use/Output
This will handle your normal case, e.g.
$ ./bin/argv1csvints 1,2,3,4,5
1 2 3 4 5
It will warn on bad arguments in list while saving all good arguments in your array:
$ ./bin/argv1csvints 1,my-cow,3,my-cat,5
error: no digits converted.
error: no digits converted.
1 3 5
As well as handling completely bad input:
$ ./bin/argv1csvints my-cow
error: no digits converted.
Or no argument at all:
$ ./bin/argv1csvints
error: no argument provided.
Or more than the expected 1 argument:
$ ./bin/argv1csvints 1,2,3,4,5 6,7,8
warning: more than one argument provided.
1 2 3 4 5
The point to be made it that with a little extra code, you can make your argument parsing routine as robust as need be. While your use of a single argument with comma-separated values is unusual, it is doable. Either manually tokenizing (splitting) the number on the commas with strtok() (or strchr() or combination of strspn() and strcspn()), looping with sscanf() using something similar to the "%d%n" format string to get a minimal succeed / fail indication with the offset of the next number from the last, or using strtol() and taking advantage of its error reporting. It's up to you.
Look things over and let me know if you have questions.
This is how I'd deal with your requirement using strtol(). This does not damage the input string, unlike solutions using strtok(). It also handles overflows and underflows correctly, unlike solutions using atoi() or its relatives. The code assumes you want to store an array of type long; if you want to use int, you can add testing to see if the value converted is larger than INT_MAX or less than INT_MIN and report an appropriate error if it is not a valid int value.
Note that handling errors from strtol() is a tricky business, not least because every return value (from LONG_MIN up to LONG_MAX) is also a valid result. See also Correct usage of strtol(). This code requires no spaces before the comma; it permits them after the comma (so you could run ./csa43 '1, 2, -3, 4, 5' and it would work). It does not allow spaces before commas. It allows leading spaces, but not trailing spaces. These issues could be fixed with more work — probably mostly in the read_value() function. It may be that the validation work in the main loop should be delegated to the read_value() function — it would give a better separation of duty. OTOH, what's here works within limits. It would be feasible to allow trailing spaces, or spaces before commas, if that's what you choose. It would be equally feasible to prohibit leading spaces and spaces after commas, if that's what you choose.
#include <errno.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
static int read_val(const char *str, char **eov, long *value)
{
errno = 0;
char *eon;
if (*str == '\0')
return -1;
long val = strtol(str, &eon, 0);
if (eon == str || (*eon != '\0' && *eon != ',') ||
((val == LONG_MIN || val == LONG_MAX) && errno == ERANGE))
{
fprintf(stderr, "Could not convert '%s' to an integer "
"(the leftover string is '%s')\n", str, eon);
return -1;
}
*value = val;
*eov = eon;
return 0;
}
int main(int argc, char **argv)
{
if (argc != 2)
{
fprintf(stderr, "Usage: %s n1,n2,n3,...\n", argv[0]);
exit(EXIT_FAILURE);
}
enum { NUM_ARRAY = 200 };
long array[NUM_ARRAY];
size_t nvals = 0;
char *str = argv[1];
char *eon;
long val;
while (read_val(str, &eon, &val) == 0 && nvals < NUM_ARRAY)
{
array[nvals++] = val;
str = eon;
if (str[0] == ',' && str[1] == '\0')
{
fprintf(stderr, "%s: trailing comma in number string\n", argv[1]);
exit(EXIT_FAILURE);
}
else if (str[0] == ',')
str++;
}
for (size_t i = 0; i < nvals; i++)
printf("[%zu] = %ld\n", i, array[i]);
return 0;
}
Output (program csa43 compiled from csa43.c):
$ csa43 1,2,3,4,5
[0] = 1
[1] = 2
[2] = 3
[3] = 4
[4] = 5
$

C if statement, optimal way to check for special characters and letters

Hi folks thanks in advance for any help, I'm doing the CS50 course i'm at the very beginning of programming.
I'm trying to check if the string from the main function parameter string argv[] is indeed a number, I searched multiple ways.
I found in another topic How can I check if a string has special characters in C++ effectively?, on the solution posted by the user Jerry Coffin:
char junk;
if (sscanf(str, "%*[A-Za-z0-9_]%c", &junk))
/* it has at least one "special" character
else
/* no special characters */
if seems to me it may work for what I'm trying to do, I'm not familiar with the sscanf function, I'm having a hard time, to integrate and adapt to my code, I came this far I can't understand the logic of my mistake:
#include <cs50.h>
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
int numCheck(string[]);
int main(int argc, string argv[]) {
//Function to check for user "cooperation"
int key = numCheck(argv);
}
int numCheck(string input[]) {
int i = 0;
char junk;
bool usrCooperation = true;
//check for user "cooperation" check that key isn't a letter or special sign
while (input[i] != NULL) {
if (sscanf(*input, "%*[A-Za-z_]%c", &junk)) {
printf("test fail");
usrCooperation = false;
} else {
printf("test pass");
}
i++;
}
return 0;
}
check if the string from the main function parameter string argv[] is indeed a number
A direct way to test if the string converts to an int is to use strtol(). This nicely handles "123", "-123", "+123", "1234567890123", "x", "123x", "".
int numCheck(const char *s) {
char *endptr;
errno = 0; // Clear error indicator
long num = strtol(s, &endptr, 0);
if (s == endptr) return 0; // no conversion
if (*endptr) return 0; // Junk after the number
if (errno) return 0; // Overflow
if (num > INT_MAX || num < INT_MIN) return 0; // int Overflow
return 1; // Success
}
int main(int argc, string argv[]) {
// Call each arg[] starting with `argv[1]`
for (int a = 1; a < argc; a++) {
int success = numCheck(argv[a]);
printf("test %s\n", success ? "pass" : "fail");
}
}
sscanf(*input, "%*[A-Za-z_]%c", &junk) is the wrong approach for testing numerical conversion.
You pass argv to numcheck and test all strings in it: this is incorrect as argv[0] is the name of the running executable, so you should skip this argument. Note also that you should pass input[i] to sscanf(), not *input.
Furthermore, lets analyze the return value of sscanf(input[i], "%*[A-Za-z_]%c", &junk):
it returns EOF if the input string is empty,
it returns 0 if %*[A-Za-z_] fails,
it also returns 0 if the conversion %c fails after the %*[A-Za-z_] succeeds,
it returns 1 is both conversions succeed.
This test is insufficient to check for non digits in the string, it does not actually give useful information: the return value will be 0 for the string "1" and also for the string "a"...
sscanf() is very tricky, full of quirks and traps. Definitely not the right tool for pattern matching.
If the goal is to check that the strings contain only digits (at least one), use this instead, using the often overlooked standard function strspn():
#include <stdio.h>
#include <string.h>
int numCheck(char *input[]) {
int i;
int usrCooperation = 1;
//check for user "cooperation" check that key isn't a letter or special sign
for (i = 1; input[i] != NULL; i++) {
// count the number of matching character at the beginning of the string
int ndigits = strspn(input[i], "0123456789");
// check for at least 1 digit and no characters after the digits
if (ndigits > 0 && input[i][ndigits] == '\0') {
printf("test passes: %d digits\n", ndigits);
} else {
printf("test fails\n");
usrCooperation = 0;
}
}
return usrCooperation;
}
Let's try this again:
This is still your problem:
if (sscanf(*input, "%*[A-Za-z_]%c", &junk))
but not for the reason I originally said - *input is equal to input[0]. What you want to have there is
if ( sscanf( input[i], "%*[A-Za-z_]%c", &junk ) )
what you're doing is cycling through all your command line arguments in the while loop:
while( input[i] != NULL )
but you're only actually testing input[0].
So, quick primer on sscanf:
The first argument (input) is the string you're scanning. The type of this argument needs to be char * (pointer to char). The string typedef name is an alias for char *. CS50 tries to paper over the grosser parts of C string handling and I/O and the string typedef is part of that, but it's unique to the CS50 course and not a part of the language. Beware.
The second argument is the format string. %[ and %c are format specifiers and tell sscanf what you're looking for in the string. %[ specifies a set of characters called a scanset - %[A-Za-z_] means "match any sequence of upper- and lowercase letters and underscores". The * in %*[A-Za-z_] means don't assign the result of the scan to an argument. %c matches any character.
Remaining arguments are the input items you want to store, and their type must match up with the format specifier. %[ expects its corresponding argument to have type char * and be the address of an array into which the input will be stored. %c expects its corresponding argument (in this case junk) to also have type char *, but it's expecting the address of a single char object.
sscanf returns the number of items successfully read and assigned - in this case, you're expecting the return value to be either 0 or 1 (because only junk gets assigned to).
Putting it all together,
sscanf( input, "%*[A-Za-z_]%c", &junk )
will read and discard characters from input up until it either sees the string terminator or a character that is not part of the scanset. If it sees a character that is not part of the scanset (such as a digit), that character gets written to junk and sscanf returns 1, which in this context is treated as "true". If it doesn't see any characters outside of the scanset, then nothing gets written to junk and sscanf returns 0, which is treated as "false".
EDIT
So, chqrlie pointed out a big error of mine - this test won't work as intended.
If there are no non-letter and non-underscore characters in input[i], then nothing gets assigned to junk and sscanf returns 0 (nothing assigned). If input[i] starts with a letter or underscore but contains a non-letter or non-underscore character later on, that bad character will be converted and assigned to junk and sscanf will return 1.
So far so good, that's what you want to happen. But...
If input[i] starts with a non-letter or non-underscore character, then you have a matching failure and sscanf bails out, returning 0. So it will erroneously match a bad input.
Frankly, this is not a very good way to test for the presence of "bad" characters.
A potentially better way would be to use something like this:
while ( input[i] )
{
bool good = true;
/**
* Cycle through each character in input[i] and
* check to see if it's a letter or an underscore;
* if it isn't, we set good to false and break out of
* the loop.
*/
for ( char *c = input[i]; *c; c++ )
{
if ( !isalpha( *c ) && *c != '_' )
{
good = false;
break;
}
}
if ( !good )
{
puts( "test fails" );
usrCooperation = 0;
}
else
{
puts( "test passes" );
}
}
I followed the solution by the user "chux - Reinstate Monica". thaks everybody for helping me solve this problem. Here is my final program, maybe it can help another learner in the future. I decided to avoid using the non standard library "cs50.h".
//#include <cs50.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <limits.h>
void keyCheck(int);
int numCheck(char*);
int main(int argc, char* argv[])
{
//Error code == 1;
int key = 0;
keyCheck(argc); //check that two parameters where sent to main.
key = numCheck(argv[1]); //Check for user "cooperation".
return 0;
}
//check for that main received two parameters.
void keyCheck(int key)
{
if (key != 2) //check that main argc only has two parameter. if not terminate program.
{
exit(1);
}
}
//check that the key (main parameter (argv [])) is a valid number.
int numCheck(char* input)
{
char* endptr;
errno = 0;
long num = strtol(input, &endptr, 0);
if (input == endptr) //no conversion is possible.
{
printf("Error: No conversion possible");
return 1;
}
else if (errno == ERANGE) //Input out of range
{
printf("Error: Input out of range");
return 1;
}
else if (*endptr) //Junk after numeric text
{
printf("Error: data after main parameter");
return 1;
}
else //conversion succesfull
{
//verify that the long int is in the integer limits.
if (num >= INT_MIN && num <= INT_MAX)
{
return num;
}
//if the main parameter is bigger than an int, terminate program
else
{
printf("Error key out of integer limits");
exit(1);
}
}
/* else
{
printf("Success: %ld", num);
return num;
} */
}

Writing a C program that removes every occurrence of a char except the last one

Im trying to write a C program that removes all occurrences of repeating chars in a string except the last occurrence.For example if I had the string
char word[]="Hihxiivaeiavigru";
output should be:
printf("%s",word);
hxeavigru
What I have so far:
#include <stdio.h>
#include <string.h>
int main()
{
char word[]="Hihxiiveiaigru";
for (int i=0;i<strlen(word);i++){
if (word[i+1]==word[i]);
memmove(&word[i], &word[i + 1], strlen(word) - i);
}
printf("%s",word);
return 0;
}
I am not sure what I am doing wrong.
With short strings, any algorithm will do. OP's attempt is O(n*n) (as well as other working answers and #David C. Rankin that identified OP's short-comings.)
But what if the string was thousands, millions in length?
Consider the following algorithm: #paulsm4
Form a `bool` array used[CHAR_MAX - CHAR_MIN + 1] and set each false.
i,unique = n - 1;
From the end of the string (n-1 to 0) to the front:
if (character never seen yet) { // used[] look-up
array[unique] = array[i];
unique--;
}
Mark used[array[i]] as true (index from CHAR_MIN)
i--;
Shift the string "to the left" (unique - i) places
Solution is O(n)
Coding goal is too fun to just post a fully coded answer.
I would first write a function to determine if a char ch at a given position i is the last occurence of ch given a char *. Like,
bool isLast(char *word, char ch, int p) {
p++;
ch = tolower(ch);
while (word[p] != '\0') {
if (tolower(word[p]) == ch) {
return false;
}
p++;
}
return true;
}
Then you can use that to iteratively emit your desired characters like
int main() {
char *word = "Hihxiivaeiavigru";
for (int i = 0; word[i] != '\0'; i++) {
if (isLast(word, word[i], i)) {
putchar(word[i]);
}
}
putchar('\n');
}
And (for completeness) I used
#include <stdio.h>
#include <ctype.h>
#include <stdbool.h>
Outputs (as requested)
hxeavigru
Additional areas where you are currently hurting yourself.
Your for loop must NOT increment the index, e.g. for (int i=0; word[i];). This is because when you memmove() by 1, you have just incremented the indexes. That also means the value to save for last is now i - 1.
there should only be one call to strlen() in the program. You can simply subtract one from length each time memmove() is called.
only increment your loop counter variable when memmove() is not called.
Additionally, avoid hardcoding strings. You shouldn't have to recompile your code just to test the results of "Hihxiivaeiaigrui" instead of "Hihxiivaeiaigru". You shouldn't have to recompile just to remove all but the last 'a' instead of the 'i'. Either pass the string and character to find as arguments to your program (that's what int argc, char **argv are for), or prompt the user for input.
Putting it altogether you could do (presuming word is 1023 characters or less):
#include <stdio.h>
#include <string.h>
#define MAXC 1024
int main (int argc, char **argv) {
char word[MAXC]; /* storage for word */
strcpy (word, argc > 1 ? argv[1] : "Hihxiivaeiaigru"); /* copy to word */
int find = argc > 2 ? *argv[2] : 'i', /* character to find */
last = -1; /* last index where find found */
size_t len = strlen (word); /* only compute strlen once */
printf ("%s (removing all but last %c)\n", word, find);
for (int i=0; word[i];) { /* loop over each char -- do NOT increment */
if (word[i] == find) { /* is this my character to find? */
if (last != -1) { /* if last is set */
/* overwrite last with rest of word */
memmove (&word[last], &word[last + 1], (int)len - last);
last = i - 1; /* last now i - 1 (we just moved it) */
len = len - 1;
}
else { /* last not set */
last = i; /* set it */
i++; /* increment loop counter */
}
}
else /* all other chars */
i++; /* just increment loop counter */
}
puts (word); /* output result -- no need for printf (no coversions) */
}
Example Use/Output
$ ./bin/rm_all_but_last_occurrence
Hihxiivaeiaigru (removing all but last i)
Hhxvaeaigru
What if you want to use "Hihxiivaeiaigrui"? Just pass it as the 1st argument:
$ ./bin/rm_all_but_last_occurrence Hihxiivaeiaigrui
Hihxiivaeiaigrui (removing all but last i)
Hhxvaeagrui
What if you want to use "Hihxiivaeiaigrui" and remove duplicate 'a' characters? Just pass the string to search as the 1st argument and the character to find as the second:
$ ./bin/rm_all_but_last_occurrence Hihxiivaeiaigrui a
Hihxiivaeiaigrui (removing all but last a)
Hihxiiveiaigrui
Nothing removed if only one of the characters:
$ ./bin/rm_all_but_last_occurrence Hihxiivaeiaigrui H
Hihxiivaeiaigrui (removing all but last H)
Hihxiivaeiaigrui
Let me know if you have further questions.
Im trying to write a C program that removes all occurrences of repeating chars in a string except the last occurrence.
Process the string (or word) from last character and move towards the first character of string (or word). Now, think of it as a problem where you have to remove all occurrence of a character from string and except the first occurrence. Since, we are processing the string from last character to first character, so, we have to move the characters, which are remain after removing duplicates, to the start of string once you have processed whole string and, if, there were duplicate characters found in the string. The complexity of this algorithm is O(n).
Implementation:
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#define INDX(x) (tolower(x) - 'a')
void remove_dups_except_last (char str[]) {
int map[26] = {0}; /* to keep track of a character processed */
size_t len = strlen (str);
char *p = str + len; /* pointer pointing to null character of input string */
size_t i = 0;
for (i = len; i != 0; --i) {
if (map[INDX(str[i - 1])] == 0) {
map[INDX(str[i - 1])] = 1;
*--p = str[i - 1];
}
}
/* if there were duplicates characters then only copy
*/
if (p != str) {
for (i = 0; *p; ++i) {
str[i] = *p++;
}
str[i] = '\0';
}
}
int main(int argc, char* argv[])
{
if (argc != 2) {
printf ("Invalid number of arguments\n");
return -1;
}
char str[1024] = {0};
/* Assumption: the input string/word will contain characters A-Z and a-z
* only and size of input will not be more than 1023.
*
* Leaving it up to you to check the valid characters in input string/word
*/
strcpy (str, argv[1]);
printf ("Original string : %s\n", str);
remove_dups_except_last (str);
printf ("Removed duplicated characters except the last one, modified string : %s\n", str);
return 0;
}
Testcases output:
# ./a.out Hihxiivaeiavigru
Original string : Hihxiivaeiavigru
Removed duplicated characters except the last one, modified string : hxeavigru
# ./a.out aa
Original string : aa
Removed duplicated characters except the last one, modified string : a
# ./a.out a
Original string : a
Removed duplicated characters except the last one, modified string : a
# ./a.out TtYyuU
Original string : TtYyuU
Removed duplicated characters except the last one, modified string : tyU
You can re-iterate to get each characters of your string, then if it is not "i" and not the last occurrence of the i, copy to a new string.
#include <stdio.h>
#include <string.h>
int main() {
char word[]="Hihxiiveiaigru";
char newword[10000];
char* ptr = strrchr(word, 'i');
int index=0;
int index2=0;
while (index < strlen(word)) {
if (word[index]!='i' || index ==(ptr - word)) {
newword[index2]=word[index];
index2++;
}
index++;
}
printf("%s",newword);
return 0;
}

How do I check if a pattern exists in an entered string?

I have an assignment where the user enters a string and then a pattern in one function, and then has to check if the pattern exists in the string and how many times it appears and at what offset. I'm stumped and my classmates keep giving me cryptic hints. Below is my get function
int getNums()
{
printf("Please enter a number: "); //Initial printf
int count, patcount;
int torf;
char len_num[31]; //The character array for the initial entered string
char pat_num[6]; //The character array for the entered pattern after initial string
char *lenptr = len_num; //pointer to the address of the first element of len_num
char *patptr = pat_num; //pointer to the address of the first element of len_num
scanf("%s", len_num); //Where the user scans in their wanted number, which is treated as a string
printf("\n");
printf("%s\n", lenptr);
int len = stringLength(lenptr); //Checks how long string is
int valid = isValid(len_num); //Checks if string is valid
for(count=0; count<len_num[count]; count++) //Checks if length of string is within appropriate range
{
if(len>=10 && len<=30) //Continues to pattern get if within range
{
torf=1;
}
else //Denies continuation if string is outside of range
{
torf=0;
printf("Not within range! Try again!\n");
return (1);
}
}
printf("Please enter a pattern: "); //Initial entry statement for pattern
scanf("%s", pat_num); //User scans in pattern
printf("\n");
printf("%s\n", pat_num);
len = stringPattern(patptr); //Check how long pattern is
valid = isValid(pat_num); //Checks if pattern is valid
for(patcount=0; patcount<pat_num[patcount]; patcount++) //Checks if length of pattern is within appropriate range
{
if(len>=2 && len<=5) //Continues to pattern check if within range
{
torf=1;
}
else //Denies continuation if pattern is outside of range
{
torf=0;
printf("Pattern not within range! Try again!\n");
return (1);
}
}
checkPattern();
}
I don't know how I should start my check function. Not to mention I have to pass by reference with pointers and I'm stuck with that too
Since you have asked for the pattern matching function, I did not check your string input function. You may use this simple driver code to test my solution:
#include <stdio.h>
void findPattern(char* input, char* pattern);
int main()
{
char input[31], pattern[6];
printf("Enter the string: ");
scanf("%s", input);
printf("Enter the pattern: ");
scanf("%s", pattern);
findPattern(input, pattern);
return 0;
}
I prefer findPattern over checkPattern. You shall rename it according to your convenience. I have not used any library functions apart from that in stdio.h as per your requirement. Following is my take on this task, I have explained the logic in the comments. Basically, it just iterates over the entire input string once where it checks for a match with the initial character in the pattern. If so, it marks the offset and searches further down the pattern to find a complete match.
void findPattern(char* input, char* pattern)
{
int i = 0; // iterator for input
int j = 0; // iterator for pattern
// solution variables
int offset = 0;
int occurrence = 0;
// Search the entire input string
while (input[i] != '\0')
{
// Mark the offset whenever the first character of the pattern matches
if (input[i] == pattern[j])
{
offset = i;
// I didn't quite get the relativity of your offset
// Maybe you need: offset = i + 1;
}
// Search for complete pattern match
while (input[i] != '\0' && pattern[j] == input[i])
{
// Go for the next character in the pattern
++j;
// The pattern matched successfully if the entire pattern was searched
if (pattern[j] == '\0')
{
// Display the offset
printf("\nPattern found at offset %d", offset);
// Increment the occurrence
++occurrence;
// There are no more characters left in the pattern
break;
}
else
{
// Go for the next character in the input
// only if there are more characters left to be searched in the pattern
++i;
}
}
// Reset the pattern iterator to search for a new match
j = 0;
// Increment the input iterator to search further down the string
++i;
}
// Display the occurrence of the pattern in the input string
printf("\nThe pattern has occurred %d times in the given string", occurrence);
}
I have to pass by reference with pointers and I'm stuck with that too
If that's the case then instead of findPattern(input, pattern);, call this function as:
findPattern(&input, &pattern);
You may be way over thinking the solution. You have a string input with a number of characters that you want to count the number of multi-character matches of pattern in. One nice thing about strings is you do not need to know how long they are to iterate over them, because by definition a string in C ends with the nul-terminating character.
This allows you to simply keep an index within your findpattern function and you increment the index each time the character from input matches the character in pattern (otherwise you zero the index). If you reach the point where pattern[index] == '\0' you have matched all characters in your pattern.
You must always declare a function with a type that will provide a meaningful return to indicate success/failure of whatever operation the function carries out if it is necessary to the remainder of your code (if the function just prints output -- then void is fine).
Otherwise, you need to choose a sane return type to indicate whether (and how many) matches of pattern were found in input. Here a simple int type will do. (which limits the number of matches that can be returned to 2147483647 which should be more than adequate).
Putting those pieces together, you could simplify your function to something similar to:
int findpattern (const char *input, const char *ptrn)
{
int n = 0, idx = 0; /* match count and pattern index */
while (*input) { /* loop over each char in s */
if (*input == ptrn[idx]) /* if current matches pattern char */
idx++; /* increment pattern index */
else /* otherwize */
idx = 0; /* zero pattern index */
if (!ptrn[idx]) { /* if end of pattern - match found */
n++; /* increment match count */
idx = 0; /* zero index for next match */
}
input++; /* increment pointer */
}
return n; /* return match count */
}
Adding a short example program that allows you to enter the pattern and input as the first two arguments to the program (or uses the defaults shown if one or both are not provided):
int main (int argc, char **argv) {
char *pattern = argc > 1 ? argv[1] : "my",
*input = argc > 2 ? argv[2] : "my dog has fleas, my cat has none";
int n;
if ((n = findpattern (input, pattern)))
printf ("'%s' occurs %d time(s) in '%s'\n", pattern, n, input);
else
puts ("pattern not found");
}
Note how providing a meaningful return allows you to both (1) validate whether or not a match was found; and (2) provides the number of matches found through the return. The complete code just needs the header stdio.h, e.g.
#include <stdio.h>
int findpattern (const char *input, const char *ptrn)
{
int n = 0, idx = 0; /* match count and pattern index */
while (*input) { /* loop over each char in s */
if (*input == ptrn[idx]) /* if current matches pattern char */
idx++; /* increment pattern index */
else /* otherwize */
idx = 0; /* zero pattern index */
if (!ptrn[idx]) { /* if end of pattern - match found */
n++; /* increment match count */
idx = 0; /* zero index for next match */
}
input++; /* increment pointer */
}
return n; /* return match count */
}
int main (int argc, char **argv) {
char *pattern = argc > 1 ? argv[1] : "my",
*input = argc > 2 ? argv[2] : "my dog has fleas, my cat has none";
int n;
if ((n = findpattern (input, pattern)))
printf ("'%s' occurs %d time(s) in '%s'\n", pattern, n, input);
else
puts ("pattern not found");
}
Example Use/Output
Check for multiple matches:
$ ./bin/findpattern
'my' occurs 2 time(s) in 'my dog has fleas, my cat has none'
A single match:
$ ./bin/findpattern fleas
'fleas' occurs 1 time(s) in 'my dog has fleas, my cat has none'
Pattern not found
$ ./bin/findpattern gophers
pattern not found
All the same pattern:
$ ./bin/findpattern my "mymymy"
'my' occurs 3 time(s) in 'mymymy'
Output From Function Itself
While it would be better to provide a return to indicate the number of matches (which would allow the function to be reused in a number of different ways), if you did just want to make this an output function that outputs the results each time it is called, then simply move the output into the function and declare another pointer to input so input is preserved for printing at the end.
The changes are minimal, e.g.
#include <stdio.h>
void findpattern (const char *input, const char *ptrn)
{
const char *p = input; /* pointer to input */
int n = 0, idx = 0; /* match count and pattern index */
while (*p) { /* loop over each char in s */
if (*p == ptrn[idx]) /* if current matches pattern char */
idx++; /* increment pattern index */
else /* otherwize */
idx = 0; /* zero pattern index */
if (!ptrn[idx]) { /* if end of pattern - match found */
n++; /* increment match count */
idx = 0; /* zero index for next match */
}
p++; /* increment pointer */
}
if (n) /* output results */
printf ("'%s' occurs %d time(s) in '%s'\n", ptrn, n, input);
else
puts ("pattern not found");
}
int main (int argc, char **argv) {
char *pattern = argc > 1 ? argv[1] : "my",
*input = argc > 2 ? argv[2] : "my dog has fleas, my cat has none";
findpattern (input, pattern);
}
(use and output are the same as above)
Look things over and let me know if you have further questions.

why regexec() in posix c always return the first match,how can it return all match positions only run once?

Now when I want to return all match positions in str, such as:
abcd123abcd123abcd
Suppose I want to get all "abcd", I must use regexec(),get the first position:0, 3, then I will use:
123abcd123abcd
as the new string to use regexec() again, and so on.
I read the manual about regexec(), it says:
int regexec(const regex_t *preg, const char *string, size_t nmatch,
regmatch_t pmatch[], int eflags);
nmatch and pmatch are used to provide information regarding the location of any
matches.
but why doesn't this work?
This is my code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <regex.h>
int main(int argc, char **argv)
{
int i = 0;
int res;
int len;
char result[BUFSIZ];
char err_buf[BUFSIZ];
char* src = argv[1];
const char* pattern = "\\<[^,;]+\\>";
regex_t preg;
regmatch_t pmatch[10];
if( (res = regcomp(&preg, pattern, REG_EXTENDED)) != 0)
{
regerror(res, &preg, err_buf, BUFSIZ);
printf("regcomp: %s\n", err_buf);
exit(res);
}
res = regexec(&preg, src, 10, pmatch, REG_NOTBOL);
//~ res = regexec(&preg, src, 10, pmatch, 0);
//~ res = regexec(&preg, src, 10, pmatch, REG_NOTEOL);
if(res == REG_NOMATCH)
{
printf("NO match\n");
exit(0);
}
for (i = 0; pmatch[i].rm_so != -1; i++)
{
len = pmatch[i].rm_eo - pmatch[i].rm_so;
memcpy(result, src + pmatch[i].rm_so, len);
result[len] = 0;
printf("num %d: '%s'\n", i, result);
}
regfree(&preg);
return 0;
}
./regex 'hello, world'
the output:
num 0: 'hello'
this is my respect outputs:
num 0: 'hello'
num 1: 'world'
regexec performs a regex match. Once a match has been found regexec will return zero (i.e. successful match). The parameter pmatch will contain information about that one match. The first array index (i.e. zero) will contain the entire match, subsequent array indices contain information about capture groups/sub-expressions.
To demonstrate:
const char* pattern = "(\\w+) (\\w+)";
matched on "hello world" will output:
num 0: 'hello world' - entire match
num 1: 'hello' - capture group 1
num 2: 'world' - capture group 2
(see it in action)
In most regex environments the behaviour you seek could have been gotten by using the global modifier: /g. Regexec does not provide this modifier as a flag nor does it support modifiers. You will therefore have to loop while regexec returns zero starting from the last character of the previous match to get all matches.
The global modifier is also not available using the PCRE library (famous regex C library). The PCRE man pages have this to say about it:
By calling pcre_exec() multiple times with appropriate arguments, you
can mimic Perl's /g option

Resources