I have 26 values's that i am considering as Special Symbol and are as with special delimeter "$" the value's can be from $A to $Z.
Same time i have a predefined template as:
I have $A,$B,$C.....
Now i am allowing user to input a string that can contain a special symbol and the values of those example:
Input - $ACar $BBike $CTruck.
Then my output should be : *I have Car,Bike,Truck... *
As now all special symbol has been replaced by its values.
Note 1.if $A Car $A Bike is the input value then it should take $A as Car rest should be discarted.
If input string doesn't contain any special symbol the there should be no change in output and output will be
I have $A,$B,$C.....
3.if input start as i am a men $A glass then till $A all values should be discarted.
Which approach should i follow to make this possible?
I am thinking to do strstr on the input string and compare those with my special symbol and store the position of Special Symbol in a list and then as per the position i am thinking to take the values but i don't think it will work for me.
Processing is simplified by using a dynamic string.
like this
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
typedef struct dstr {
size_t size;
size_t capacity;
char *str;
} Dstr;//dynamic string
Dstr *dstr_make(void){
Dstr *s;
s = (Dstr*)malloc(sizeof(Dstr));
s->size = 0;
s->capacity=16;
s->str=(char*)realloc(NULL, sizeof(char)*(s->capacity += 16));
return s;
}
void dstr_addchar(Dstr *ds, const char ch){
ds->str[ds->size] = ch;
if(++ds->size == ds->capacity)
ds->str=(char*)realloc(ds->str, sizeof(char)*(ds->capacity += 16));
}
void dstr_addstr(Dstr *ds, const char *s){
while(*s) dstr_addchar(ds, *s++);
//dstr_addchar(ds, '\0');
}
void dstr_free(Dstr *ds){
free(ds->str);
free(ds);
}
void dic_entry(char *dic[26], const char *source){
char *p, *backup, ch;
p = backup = strdup(source);
for(;NULL!=(p=strtok(p, " \t\n"));p=NULL){
if(*p == '$' && isupper(ch=*(p+1))){
if(dic[ch -'A'] == NULL)
dic[ch -'A'] = strdup(p+2);
}
}
free(backup);
}
void dic_clear(char *dic[26]){
int i;
for(i=0;i<26;++i){
if(dic[i]){
free(dic[i]);
dic[i] = NULL;
}
}
}
int main(void){
const char *template = "I have $A,$B,$C.";
char *dic[26] = { 0 };
char buff[1024];
const char *cp;
Dstr *ds = dstr_make();
printf("input special value setting: ");
fgets(buff, sizeof(buff), stdin);
dic_entry(dic, buff);
for(cp=template;*cp;++cp){
if(*cp == '$'){
char ch;
if(isupper(ch=*(cp+1)) && dic[ch - 'A']!=NULL){
dstr_addstr(ds, dic[ch - 'A']);
++cp;
} else {
dstr_addchar(ds, *cp);
}
} else {
dstr_addchar(ds, *cp);
}
}
dstr_addchar(ds, '\0');
printf("result:%s\n", ds->str);
dic_clear(dic);
dstr_free(ds);
return 0;
}
/* DEMO
>a
input special value setting: $ACar $BBike $CTruck
result:I have Car,Bike,Truck.
>a
input special value setting: $BBike
result:I have $A,Bike,$C.
*/
What you're describing is called a Macro Processor or Macro Expander.
You can store your symbol table in an array indexed by the input char.
char *symtab[256] = {0};
Since the symbol names are single-letters, you can use strchr to find the first '$' and check if the next char is a letter (isupper()).
For the actual replacement, it will require some delicate memory management unless you just use really big buffers and make sure to only feed it small data.
If symtab['A'] == "Car" then you can loc = strstr(line, "$A"). Then loc-line is the length of the prefix part, 2 is the length of the symbol name being deleted, strlen("Car") is the length of the replacement, and strlen(loc+2) is the length of the suffix part. So the new string size should be
char *result = malloc( (loc-line) - 2 + strlen(symtab['A']) + strlen(loc+2) + 1);
Then patching up the new string is
strcpy(result,line);
strcpy(result + (loc-line), symtab['A']);
strcpy(result + (loc-line) + strlen(symtab['A']), loc+2);
Notice these are strcpy not strcat which appends strings together. The second and third strcpy calls overwrite the tail of the string just copied.
Related
So i've seen alot of functions like str_replace(str, substr, newstring) but all of them won't work with numbers so i was wondering if anyone had one that would work with both chars and ints or just int ive been looking everywhere and cant figure out a idea on how to write my own.
my goal exactly is to be able to replace a string with a int value in the string not just string with string
below is the function i use to replace strings and it worked just fine
void strrpc(char *target, const char *needle, const char *replacement)
{
char buffer[1024] = { 0 };
char *insert_point = &buffer[0];
const char *tmp = target;
size_t needle_len = strlen(needle);
size_t repl_len = strlen(replacement);
while (1) {
const char *p = strstr(tmp, needle);
// walked past last occurrence of needle; copy remaining part
if (p == NULL) {
strcpy(insert_point, tmp);
break;
}
// copy part before needle
memcpy(insert_point, tmp, p - tmp);
insert_point += p - tmp;
// copy replacement string
memcpy(insert_point, replacement, repl_len);
insert_point += repl_len;
// adjust pointers, move on
tmp = p + needle_len;
}
// write altered string back to target
strcpy(target, buffer);
}
You can turn an integer into a string by "printing" it to a string:
int id = get_id();
char idstr[20];
sprintf(idstr, "%d", id);
Now you can
char msg[1024] = "Processing item {id} ...";
strrpc(msg, "{id}", idstr);
puts(msg);
But note that the implementation of strrpc you found will work only if the string after replacement is shorter than 1023 character. Also note the the example above could more easily be written as just:
printf("Processing item %d ...\n", get_id());
without the danger of buffer overflow. I don't know what exactly you want to achieve, but perhaps string replacement is not the best solution here. (Just sayin'.)
i want to replace _ (underscore) with white spaces and make the first letter of the name and the surname to upper case while printing the nameList in searchKeyword method.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void searchKeyword(const char * nameList[], int n, const char keyword[])
{
int i,name=0;
char *str;
const char s[2] = " " ;
for(i=0;i<n;i++)
{
char *str = (char *) malloc((strlen(nameList[0])+1)*sizeof(char));
strcpy(str,nameList[i]);
strtok(str,"_");
if(strcmp(keyword,strtok(NULL,"_"))==0) // argument NULL will start string
{ // from last point of previous string
name++;
if(nameList[i] == '_')
strcpy(nameList[i],s);
//nameList[i] = ' ';
printf("%s\n",nameList[i]);
}
}
if(name==0)
{
printf("No such keyword found\n");
}
free(str); //deallocating space
}
int main()
{
char p1[] = "zoe_bale";
char p2[] = "sam_rodriguez";
char p3[] = "jack_alonso";
char p4[] = "david_studi";
char p5[] = "denzel_feldman";
char p6[] = "james_bale";
char p7[] = "james_willis";
char p8[] = "michael_james";
char p9[] = "dustin_bale";
const char * nameList[9] = {p1, p2, p3, p4, p5, p6, p7, p8, p9};
char keyword[100];
printf("Enter a keyword: ");
scanf("%s", keyword);
printf("\n");
searchKeyword(nameList, 9, keyword);
printf("\n");
for (int i = 0; i < 9; i++)
printf("%s\n",nameList[i]);
return 0;
}
Search through the strings and print the ones whose surname part is equal to keyword.
As shown in the example runs below, the strings are printed in “Name Surname” format (the first letters are capitalized).
Output should be like this:
Enter a keyword: james
Michael James
zoe_bale
sam_rodriguez
jack_alonso
david_studi
denzel_feldman
james_bale
james_willis
michael_james
dustin_bale
There is no reason to dynamically allocate storage for your name and surname. Looking at your input, neither will exceed 9-characters, so simply using an array for each of 64-chars provides 6X the storage required (if you are unsure, double that to 128-chars and have 1200% additional space). That avoids the comparatively expensive calls to malloc.
To check whether keyword exists in nameList[i], you don't need to separate the values first and then compare. Simply use strstr (nameList[i], keyword) to determine if keyword is contained in nameList[i]. If you then want to match only the name or surname you can compare again after they are separated. (up to you)
To parse the names from the nameList[i] string, all you need is a single pointer to locate the '_' character. A simple call to strchr() will do and it does not modify nameList[i] so there is no need to duplicate.
After using strchr() to locate the '_' character, simply memcpy() from the start of nameList[i] to your pointer to your name array, increment the pointer and then strcpy() from p to surname. Now you have separated name and surname, simply call toupper() on the first character of each and then output the names separate by a space, e.g.
...
#include <ctype.h>
#define NLEN 64
void searchKeyword (const char *nameList[], int n, const char keyword[])
{
for (int i = 0; i < n; i++) { /* loop over each name in list */
if (strstr (nameList[i], keyword)) { /* does name contain keyword? */
char name[NLEN], surname[NLEN]; /* storage for name, surname */
const char *p = nameList[i]; /* pointer to parse nameList[i] */
if ((p = strchr(p, '_'))) { /* find '_' in nameList[i] */
/* copy first-name to name */
memcpy (name, nameList[i], p - nameList[i]);
name[p++ - nameList[i]] = 0; /* nul-terminate first name */
*name = toupper (*name); /* convert 1st char to uppwer */
/* copy last name to surname */
strcpy (surname, p);
*surname = toupper (*surname); /* convert 1st char to upper */
printf ("%s %s\n", name, surname); /* output "Name Surname" */
}
}
}
}
Example Use/Output
Used with the remainder of your code, searching for "james" locates those names containing "james" and provides what looks like the output you requested, e.g.
$ ./bin/keyword_surname
Enter a keyword: james
James Bale
James Willis
Michael James
zoe_bale
sam_rodriguez
jack_alonso
david_studi
denzel_feldman
james_bale
james_willis
michael_james
dustin_bale
(note: to match only the name or surname add an additional strcmp before the call to printf to determine which you want to output)
Notes On Your Existing Code
Additional notes continuing from the comments on your existing code,
char *str = (char *) malloc((strlen(nameList[0])+1)*sizeof(char));
should simply be
str = malloc (strlen (nameList[i]) + 1);
You have previously declared char *str; so the declaration before your call to malloc() shadows your previous declaration. If you are using gcc/clang, you can add -Wshadow to your compile string to ensure you are warned of shadowed variables. (they can have dire consequences in other circumstances)
Next, sizeof (char) is always 1 and should be omitted from your size calculation. There is no need to cast the return of malloc() in C. See: Do I cast the result of malloc?
Your comparison if (nameList[i] == '_') is a comparison between a pointer and integer and will not work. Your compiler should be issuing a diagnostic telling you that is incorrect (do not ignore compiler warnings -- do not accept code until it compiles without warning)
Look things over and let me know if you have further questions.
that worked for me and has no memory leaks.
void searchKeyword(const char * nameList[], int n, const char keyword[])
{
int found = 0;
const char delim = '_';
for (int i = 0; i < n; i++) {
const char *fst = nameList[i];
for (const char *tmp = fst; *tmp != '\0'; tmp++) {
if (*tmp == delim) {
const char *snd = tmp + 1;
int fst_length = (snd - fst) / sizeof(char) - 1;
int snd_length = strlen(fst) - fst_length - 1;
if (strncmp(fst, keyword, fst_length) == 0 ||
strncmp(snd, keyword, snd_length) == 0) {
found = 1;
printf("%c%.*s %c%s\n",
fst[0]-32, fst_length-1, fst+1,
snd[0]-32, snd+1);
}
break;
}
}
}
if (!found)
puts("No such keyword found");
}
hopefully it's fine for you too, although I use string.h-functions very rarely.
I made a program in C that can find two similar or different strings and extract the string between them. This type of program has so many uses, and generally when you use such a program, you have a lot of info, so it needs to be fast. I would like tips on how to make this program as fast and efficient as possible.
I am looking for suggestions that won't make me resort to heavy libraries (such as regex).
The code must:
be able to extract a string between two similar or different strings
find the 1st occurrence of string1
find the 1st occurrence of string2 which occurs AFTER string1
extract the string between string1 and string2
be able to use string arguments of any size
be foolproof to human error and return NULL if such occurs (example, string1 exceeds entire text string length. don't crash in an element error, but gracefully return NULL)
focus on speed and efficiency
Below is my code. I am quite new to C, coming from C++, so I could probably use a few suggestions, especially regarding efficient/proper use of the 'malloc' command:
fast_strbetween.c:
/*
Compile with:
gcc -Wall -O3 fast_strbetween.c -o fast_strbetween
*/
#include <stdio.h> // printf
#include <stdlib.h> // malloc
// inline function if it pleases the compiler gods
inline size_t fast_strlen(char *str)
{
int i; // Cannot return 'i' if inside for loop
for(i = 0; str[i] != '\0'; ++i);
return i;
}
char *fast_strbetween(char *str, char *str1, char *str2)
{
// size_t segfaults when incorrect length strings are entered (due to going below 0), so use int instead for increased robustness
int str0len = fast_strlen(str);
int str1len = fast_strlen(str1);
int str1pos = 0;
int charsfound = 0;
// Find str1
do {
charsfound = 0;
while (str1[charsfound] == str[str1pos + charsfound])
++charsfound;
} while (++str1pos < str0len - str1len && charsfound < str1len);
// '++str1pos' increments past by 1: needs to be set back by one
--str1pos;
// Whole string not found or logical impossibilty
if (charsfound < str1len)
return NULL;
/* Start searching 2 characters after last character found in str1. This will ensure that there will be space, and logical possibility, for the extracted text to exist or not, and allow immediate bail if the latter case; str1 cannot possibly have anything between it if str2 is right next to it!
Example:
str = 'aa'
str1 = 'a'
str2 = 'a'
returned = '' (should be NULL)
Without such preventative, str1 and str2 would would be found and '' would be returned, not NULL. This also saves 1 do/while loop, one check pertaining to returning null, and two additional calculations:
Example, if you didn't add +1 str2pos, you would need to change the code to:
if (charsfound < str2len || str2pos - str1pos - str1len < 1)
return NULL;
It also allows for text to be found between three similar strings—what??? I can feel my brain going fuzzy!
Let this example explain:
str = 'aaa'
str1 = 'a'
str2 = 'a'
result = '' (should be 'a')
Without the aforementioned preventative, the returned string is '', not 'a'; the program takes the first 'a' for str1 and the second 'a' for str2, and tries to return what is between them (nothing).
*/
int str2pos = str1pos + str1len + 1; // the '1' added to str2pos
int str2len = fast_strlen(str2);
// Find str2
do {
charsfound = 0;
while (str2[charsfound] == str[str2pos + charsfound])
++charsfound;
} while (++str2pos < str0len - str2len + 1 && charsfound < str2len);
// Deincrement due to '++str2pos' over-increment
--str2pos;
if (charsfound < str2len)
return NULL;
// Only allocate what is needed
char *strbetween = (char *)malloc(sizeof(char) * str2pos - str1pos - str1len);
unsigned int tmp = 0;
for (unsigned int i = str1pos + str1len; i < str2pos; i++)
strbetween[tmp++] = str[i];
return strbetween;
}
int main() {
char str[30] = { "abaabbbaaaabbabbbaaabbb" };
char str1[10] = { "aaa" };
char str2[10] = { "bbb" };
//Result should be: 'abba'
printf("The string between is: \'%s\'\n", fast_strbetween(str, str1, str2));
// free malloc as we go
for (int i = 10000000; --i;)
free(fast_strbetween(str, str1, str2));
return 0;
}
In order to have some way of measuring progress, I have already timed the code above (extracting a small string 10000000 times):
$ time fast_strbetween
The string between is: 'abba'
0m11.09s real 0m11.09s user 0m00.00s system
Process used 99.3 - 100% CPU according to 'top' command (Linux).
Memory used while running: 3.7Mb
Executable size: 8336 bytes
Ran on a Raspberry Pi 3B+ (4 x 1.4Ghz, Arm 6)
If anyone would like to offer code, tips, pointers... I would appreciate it. I will also implement the changes and give a timed result for your troubles.
Oh, and one thing that I learned is to always de-allocate malloc; I ran the code above (with extra loops), just before posting this. My computer's ram filled up, and the computer froze. Luckily, Stack made a backup draft! Lesson learned!
* EDIT *
Here is the revised code using chqrlie's advice as best I could. Added extra checks for end of string, which ended up costing about a second of time with the tested phrase but can now bail very fast if the first string is not found. Using null or illogical strings should not result in error, hopefully. Lots of notes int the code, where they can be better understood. If I've left anything thing out or done something incorrectly, please let me know guys; it is not intentional.
fast_strbetween2.c:
/*
Compile with:
gcc -Wall -O3 fast_strbetween2.c -o fast_strbetween2
Corrections and additions courtesy of:
https://stackoverflow.com/questions/55308295/extracting-a-string-between-two-similar-or-different-strings-in-c-as-fast-as-p
*/
#include<stdio.h> // printf
#include<stdlib.h> // malloc, free
// Strings now set to 'const'
char * fast_strbetween(const char *str, const char *str1, const char *str2)
{
// string size will now be calculated by the characters picked up
size_t str1pos = 0;
size_t str1chars;
// Find str1
do{
str1chars = 0;
// Will the do/while str1 check for '\0' suffice?
// I haven't seen any issues yet, but not sure.
while(str1[str1chars] == str[str1pos + str1chars] && str1[str1chars] != '\0')
{
//printf("Found str1 char: %i num: %i pos: %i\n", str1[str1chars], str1chars + 1, str1pos);
++str1chars;
}
// Incrementing whilst not in conditional expression tested faster
++str1pos;
/* There are two checks for "str1[str1chars] != '\0'". Trying to find
another efficient way to do it in one. */
}while(str[str1pos] != '\0' && str1[str1chars] != '\0');
--str1pos;
//For testing:
//printf("str1pos: %i str1chars: %i\n", str1pos, str1chars);
// exit if no chars were found or if didn't reach end of str1
if(!str1chars || str1[str1chars] != '\0')
{
//printf("Bailing from str1 result\n");
return '\0';
}
/* Got rid of the '+1' code which didn't allow for '' returns.
I agree with your logic of <tag></tag> returning ''. */
size_t str2pos = str1pos + str1chars;
size_t str2chars;
//printf("Starting pos for str2: %i\n", str1pos + str1chars);
// Find str2
do{
str2chars = 0;
while(str2[str2chars] == str[str2pos + str2chars] && str2[str2chars] != '\0')
{
//printf("Found str2 char: %i num: %i pos: %i \n", str2[str2chars], str2chars + 1, str2pos);
++str2chars;
}
++str2pos;
}while(str[str2pos] != '\0' && str2[str2chars] != '\0');
--str2pos;
//For testing:
//printf("str2pos: %i str2chars: %i\n", str2pos, str2chars);
if(!str2chars || str2[str2chars] != '\0')
{
//printf("Bailing from str2 result!\n");
return '\0';
}
/* Trying to allocate strbetween with malloc. Is this correct? */
char * strbetween = malloc(2);
// Check if malloc succeeded:
if (strbetween == '\0') return '\0';
size_t tmp = 0;
// Grab and store the string between!
for(size_t i = str1pos + str1chars; i < str2pos; ++i)
{
strbetween[tmp] = str[i];
++tmp;
}
return strbetween;
}
int main() {
char str[30] = { "abaabbbaaaabbabbbaaabbb" };
char str1[10] = { "aaa" };
char str2[10] = { "bbb" };
printf("Searching \'%s\' for \'%s\' and \'%s\'\n", str, str1, str2);
printf(" 0123456789\n\n"); // Easily see the elements
printf("The word between is: \'%s\'\n", fast_strbetween(str, str1, str2));
for(int i = 10000000; --i;)
free(fast_strbetween(str, str1, str2));
return 0;
}
** Results **
$ time fast_strbetween2
Searching 'abaabbbaaaabbabbbaaabbb' for 'aaa' and 'bbb'
0123456789
The word between is: 'abba'
0m10.93s real 0m10.93s user 0m00.00s system
Process used 99.0 - 100% CPU according to 'top' command (Linux).
Memory used while running: 1.8Mb
Executable size: 8336 bytes
Ran on a Raspberry Pi 3B+ (4 x 1.4Ghz, Arm 6)
chqrlie's answer
I understand that this is just some example code that shows proper programming practices. Nonetheless, it can make for a decent control in testing.
Please note that I do not know how to deallocate malloc in your code, so it is NOT a fair test. As a result, ram usage builds up, taking 130Mb+ for the process alone. I was still able to run the test for the full 10000000 loops. I will say that I tried deallocating this code the way I did my code (via bringing the function 'simple_strbetween' down into main and deallocating with 'free(strndup(p, q - p));'), and the results weren't much different from not deallocating.
** simple_strbetween.c **
/*
Compile with:
gcc -Wall -O3 simple_strbetween.c -o simple_strbetween
Courtesy of:
https://stackoverflow.com/questions/55308295/extracting-a-string-between-two-similar-or-different-strings-in-c-as-fast-as-p
*/
#include<string.h>
#include<stdio.h>
char *simple_strbetween(const char *str, const char *str1, const char *str2) {
const char *q;
const char *p = strstr(str, str1);
if (p) {
p += strlen(str1);
q = *str2 ? strstr(p, str2) : p + strlen(p);
if (q)
return strndup(p, q - p);
}
return NULL;
}
int main() {
char str[30] = { "abaabbbaaaabbabbbaaabbb" };
char str1[10] = { "aaa" };
char str2[10] = { "bbb" };
printf("Searching \'%s\' for \'%s\' and \'%s\'\n", str, str1, str2);
printf(" 0123456789\n\n"); // Easily see the elements
printf("The word between is: \'%s\'\n", simple_strbetween(str, str1, str2));
for(int i = 10000000; --i;)
simple_strbetween(str, str1, str2);
return 0;
}
$ time simple_strbetween
Searching 'abaabbbaaaabbabbbaaabbb' for 'aaa' and 'bbb'
0123456789
The word between is: 'abba'
0m19.68s real 0m19.34s user 0m00.32s system
Process used 100% CPU according to 'top' command (Linux).
Memory used while running: 130Mb (leak due do my lack of knowledge)
Executable size: 8380 bytes
Ran on a Raspberry Pi 3B+ (4 x 1.4Ghz, Arm 6)
Results for above code ran with this alternate strndup:
char *alt_strndup(const char *s, size_t n)
{
size_t i;
char *p;
for (i = 0; i < n && s[i] != '\0'; i++)
continue;
p = malloc(i + 1);
if (p != NULL) {
memcpy(p, s, i);
p[i] = '\0';
}
return p;
}
$ time simple_strbetween
Searching 'abaabbbaaaabbabbbaaabbb' for 'aaa' and 'bbb'
0123456789
The word between is: 'abba'
0m20.99s real 0m20.54s user 0m00.44s system
I kindly ask that nobody make judgements on the results until the code is properly ran. I will revise the results as soon as it is figured out.
* Edit *
Was able to decrease the time by over 25% (11.93s vs 8.7s). This was done by using pointers to increment the positions, as opposed to size_t. Collecting the return string whilst checking the last string was likely what caused the biggest change. I feel there is still lots of room for improvement. A big loss comes from having to free malloc. If there is a better way, I'd like to know.
fast_strbetween3.c:
/*
gcc -Wall -O3 fast_strbetween.c -o fast_strbetween
*/
#include<stdio.h> // printf
#include<stdlib.h> // malloc, free
char * fast_strbetween(const char *str, const char *str1, const char *str2)
{
const char *sbegin = &str1[0]; // String beginning
const char *spos;
// Find str1
do{
spos = str;
str1 = sbegin;
while(*spos == *str1 && *str1)
{
++spos;
++str1;
}
++str;
}while(*str1 && *spos);
// Nothing found if spos hasn't advanced
if (spos == str)
return NULL;
char *strbetween = malloc(1);
if (!strbetween)
return '\0';
str = spos;
int i = 0;
//char *p = &strbetween[0]; // Alt. for advancing strbetween (slower)
sbegin = &str2[0]; // Recycle sbegin
// Find str2
do{
str2 = sbegin;
spos = str;
while(*spos == *str2 && *str2)
{
++str2;
++spos;
}
//*p = *str;
//++p;
strbetween[i] = *str;
++str;
++i;
}while(*str2 && *spos);
if (spos == str)
return NULL;
//*--p = '\0';
strbetween[i - 1] = '\0';
return strbetween;
}
int main() {
char s[100] = "abaabbbaaaabbabbbaaabbb";
char s1[100] = "aaa";
char s2[100] = "bbb";
printf("\nString: \'%s\'\n", fast_strbetween(s, s1, s2));
for(int i = 10000000; --i; )
free(fast_strbetween(s, s1, s2));
return 0;
}
String: 'abba'
0m08.70s real 0m08.67s user 0m00.01s system
Process used 99.0 - 100% CPU according to 'top' command (Linux).
Memory used while running: 1.8Mb
Executable size: 8336 bytes
Ran on a Raspberry Pi 3B+ (4 x 1.4Ghz, Arm 6)
* Edit *
This doesn't really count as it does not 'return' a value, and therefore is against my own rules, but it does pass a variable through, which is changed and brought back to main. It runs with 1 library and takes 3.6s. Getting rid of malloc was the key.
/*
gcc -Wall -O3 fast_strbetween.c -o fast_strbetween
*/
#include<stdio.h> // printf
unsigned int fast_strbetween(const char *str, const char *str1, const char *str2, char *strbetween)
{
const char *sbegin = &str1[0]; // String beginning
const char *spos;
// Find str1
do{
spos = str;
str1 = sbegin;
while(*spos == *str1 && *str1)
{
++spos;
++str1;
}
++str;
}while(*str1 && *spos);
// Nothing found if spos hasn't advanced
if (spos == str)
{
strbetween[0] = '\0';
return 0;
}
str = spos;
sbegin = &str2[0]; // Recycle sbegin
// Find str2
do{
str2 = sbegin;
spos = str;
while(*spos == *str2 && *str2)
{
++str2;
++spos;
}
*strbetween = *str;
++strbetween;
++str;
}while(*str2 && *spos);
if (spos == str)
{
strbetween[0] = '\0';
return 0;
}
*--strbetween = '\0';
return 1; // Successful (found text)
}
int main() {
char s[100] = "abaabbbaaaabbabbbaaabbb";
char s1[100] = "aaa";
char s2[100] = "bbb";
char sret[100];
fast_strbetween(s, s1, s2, sret);
printf("String: %s\n", sret);
for(int i = 10000000; --i; )
fast_strbetween(s, s1, s2, sret);
return 0;
}
Your code has multiple problems and is probably not as efficient as it should be:
you use types int and unsigned int for indexes into the strings. These types may be smaller than the range of size_t. You should revise your code to use size_t and avoid mixing signed and unsigned types in comparisons.
your functions' string arguments should be declared as const char * as you do not modify the strings and should be able to pass const strings without a warning.
redefining strlen is a bad idea: your version will be slower than the system's optimized, assembly coded and very likely inlined version.
computing the length of str is unnecessary and potentially costly: both str1 and str2 may appear close to the beginning of str, scanning for the end of str will be wasteful.
the while loop inside the first do / while loop is incorrect: while(str1[charsfound] == str[str1pos + charsfound]) charsfound++; may access characters beyond the end of str and str1 as the loop does not stop at the null terminator. If str1 only appears at the end of str, you have undefined behavior.
if str1 is an empty string, you will find it at the end of str instead of at the beginning.
why do you initialize str2pos as int str2pos = str1pos + str1len + 1;? If str2 immediately follows str1 inside str, an empty string should be allocated and returned. Your comment regarding this case is unreadable, you should break such long lines to fit within a typical screen width such as 80 columns. It is debatable whether strbetween("aa", "a", "a") should return "" or NULL. IMHO it should return an allocated empty string, which would be consistent with the expected behavior on strbetween("<name></name>", "<name>", "</name>") or strbetween("''", "'", "'"). Your specification preventing strbetween from returning an empty string produces a counter-intuitive border case.
the second scanning loop has the same problems as the first.
the line char *strbetween = (char *) malloc(sizeof(char) * str2pos - str1pos - str1len); has multiple problems: no cast is necessary in C, if you insist on specifying the element size sizeof(char), which is 1 by definition, you should parenthesize the number of elements, and last but not least, you must allocate one extra element for the null terminator.
You do not test if malloc() succeeded. If it returns NULL, you will have undefined behavior, whereas you should just return NULL.
the copying loop uses a mix of signed and unsigned types, causing potentially counterintuitive behavior on overflow.
you forget to set the null terminator, which is consistent with the allocation size error, but incorrect.
Before you try and optimize code, you must ensure correctness! Your code is too complicated and has multiple flaws. Optimisation is a moot point.
You should first try a very simple implementation using standard C string functions: searching a string inside another one is performed efficiently by strstr.
Here is a simple implementation using strstr and strndup(), which should be available on your system:
#include <string.h>
char *simple_strbetween(const char *str, const char *str1, const char *str2) {
const char *q;
const char *p = strstr(str, str1);
if (p) {
p += strlen(str1);
q = *str2 ? strstr(p, str2) : p + strlen(p);
if (q)
return strndup(p, q - p);
}
return NULL;
}
strndup() is defined in POSIX and is part of the Extensions to the C Library Part II: Dynamic Allocation Functions, ISO/IEC TR 24731-2:2010. If it is not available on your system, it can be redefined as:
#include <stdlib.h>
#include <string.h>
char *strndup(const char *s, size_t n) {
size_t i;
char *p;
for (i = 0; i < n && s[i] != '\0'; i++)
continue;
p = malloc(i + 1);
if (p != NULL) {
memcpy(p, s, i);
p[i] = '\0';
}
return p;
}
To ensure correctness, write a number of test cases, with border cases such as all combinations of empty strings and identical strings.
Once your have thoroughly your strbetween function, you can write a benchmarking framework to test performance. This is not so easy to get reliable performance figures, as you will experience if you try. Remember to configure your compiler to select the appropriate optimisations, -O3 for example.
Only then can you move to the next step: if you are really restricted from using standard C library functions, you may first recode your versions of strstr and strlen and still use the same method. Test this new version both for correctness and for performance.
The redundant parts are the computation of strlen(str1) which must have been determined by strstr when it finds a match. And the scan in strndup() which is unnecessary since no null byte is present between p and q. If you have time to waste, you can try and remove these redundancies at the expense of readability, risking non conformity. I would be surprised if you get any improvement at all on average over a wide variety of test cases. 20% would be remarkable.
I have a program that displays UTF-8 encoded strings with a size limitation (say MAX_LEN).
Whenever I get a string with a length > MAX_LEN, I want to find out where I could split it so it would be printed gracefully.
For example:
#define MAX_LEN 30U
const char big_str[] = "This string cannot be displayed on one single line: it must be splitted"
Without process, the output will looks like:
"This string cannot be displaye" // Truncated because of size limitation
"d on one single line: it must "
"be splitted"
The client would be able to chose eligible delimiters for the split but for now, I defined a list of delimiters by default:
#define DEFAULT_DELIMITERS " ;:,)]" // Delimiters to track in the string
So I am looking for an elegant and lightweight way of handling these issue without using malloc: my API should not return the sub-strings, I just want the positions of the sub-strings to display.
I already have some ideas that I will propose in answer: any feedback (e.g. pros and cons) would be appreciated, but most of all I am interested in alternatives solutions.
I just want the positions of the sub-strings to display.
So all you need is one function analysing your input returning the positions where a delimiter was found.
A possible appoach using strpbrk() assuming C99 at least:
#include <unistd.h> /* for ssize_t */
#include <string.h>
#define DELIMITERS (" ;.")
void find_delimiter_positions(
const char * input,
const char * delimiters,
ssize_t * delimiter_positions)
{
ssize_t dp_current = 0;
const char * p = input;
while (NULL != (p = strpbrk(p, delimiters)))
{
delimiter_positions[dp_current] = p - input;
++dp_current;
++p;
}
}
int main(void)
{
char input[] = "some randrom data; more.";
size_t input_length = strlen(input);
ssize_t delimiter_positions[input_length];
for (size_t s = 0; s < input_length; ++s)
{
delimiter_positions[s] = -1;
}
find_delimiter_positions(input, DELIMITERS, delimiter_positions);
for (size_t s = 0; -1 != delimiter_positions[s]; ++s)
{
/* print out positions */
}
}
For why C99: C99 introduces V(ariable) L(ength) A(rray), which are necessary here to get around the limitation to not use dynamic memory allocation.
If VLAs also may not be used one needs to fall back a defining a maximum number of possible occurences of delimiters per string. The latter however might be feasable as the maximum length of the string to be parsed is given, which in turn would imply the maximum number of possible delimiters per string.
For the latter case those lines from the example above
char input[] = "some randrom data; more.";
size_t input_length = strlen(input);
ssize_t delimiter_positions[input_length];
could be replaced by
char input[MAX_INPUT_LEN] = "some randrom data; more.";
size_t input_length = strlen(input);
ssize_t delimiter_positions[MAX_INPUT_LEN];
An approach that doesn't require additional storage is to make the wrapping function call a callback function for each substring. In the example below, the string is just printed with plain old printf, but the callback could call any other API function.
Things to note:
There is a function next that should advance a pointer to the next UTF-8 character. The encoding width for an UTF-8 char can be seen from its first byte.
The space and punctuation delimiters are treated slightly differently: Spaces are neither appended to the end or beginning of a line. (If there aren't any consecutive spaces in the text, that is.) Punctuation is retained at the end of a line.
Here's an example implementation:
#include <assert.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#define DELIMITERS " ;:,)]"
/*
* Advance to next character. This should advance the pointer to
* up to three chars, depending on the UTF-8 encoding. (But at the
* moment, it doesn't.)
*/
static const char *next(const char *p)
{
return p + 1;
}
typedef struct {
const char *begin;
const char *end;
} substr_t;
/*
* Wraps the text and stores the found substring' ranges into
* the lines struct. Return the number of word-wrapped lines.
*/
int wrap(const char *text, int width, substr_t *lines, uint32_t max_num_lines)
{
const char *begin = text;
const char *split = NULL;
uint32_t num_lines = 1;
int l = 0;
while (*text) {
if (strchr(DELIMITERS, *text)) {
split = text;
if (*text != ' ') split++;
}
if (l++ == width) {
if (split == NULL) split = text;
lines[num_lines - 1].begin = begin;
lines[num_lines - 1].end = split;
//write(fileno(stdout), begin, split - begin);
text = begin = split;
while (*begin == ' ') begin++;
split = NULL;
l = 0;
num_lines++;
if (num_lines > max_num_lines) {
//abort();
return -1;
}
}
text = next(text);
}
lines[num_lines - 1].begin = begin;
lines[num_lines - 1].end = text;
//write(fileno(stdout), begin, split - begin);
return num_lines;
}
int main()
{
const char *text = "I have a program that displays UTF-8 encoded strings "
"with a size limitation (say MAX_LEN). Whenever I get a string with a "
"length > MAX_LEN, I want to find out where I could split it so it "
"would be printed gracefully.";
substr_t lines[100];
const uint32_t max_num_lines = sizeof(lines) / sizeof(lines[0]);
const int num_lines = wrap(text, 48, lines, max_num_lines);
if (num_lines < 0) {
fprintf(stderr, "error: can't split into %d lines\n", max_num_lines);
return EXIT_FAILURE;
}
//printf("num_lines = %d\n", num_lines);
for (int i=0; i < num_lines; i++) {
FILE *stream = stdout;
const ptrdiff_t line_length = lines[i].end - lines[i].begin;
write(fileno(stream), lines[i].begin, line_length);
fputc('\n', stream);
}
return EXIT_SUCCESS;
}
Addendum: Here's another approach that builds loosely on the strtok pattern, but without modifying the string. It requires a state and that state must be initialised with the string to print and the maximum line width:
struct wrap_t {
const char *src;
int width;
int length;
const char *line;
};
int wrap(struct wrap_t *line)
{
const char *begin = line->src;
const char *split = NULL;
int l = 0;
if (begin == NULL) return -1;
while (*begin == ' ') begin++;
if (*begin == '\0') return -1;
while (*line->src) {
if (strchr(DELIMITERS, *line->src)) {
split = line->src;
if (*line->src != ' ') split++;
}
if (l++ == line->width) {
if (split == NULL) split = line->src;
line->line = begin;
line->length = split - begin;
line->src = split;
return 0;
}
line->src = next(line->src);
}
line->line = begin;
line->length = line->src - begin;
return 0;
}
All definitions not shown (DELIMITERS, next) are as above and the basic algorithm hasn't changed. I think this method is easy to use for the client:
int main()
{
const char *text = "I have a program that displays UTF-8 encoded strings "
"with a size limitation (say MAX_LEN). Whenever I get a string with a "
"length > MAX_LEN, I want to find out where I could split it so it "
"would be printed gracefully.";
struct wrap_t line = {text, 60};
while (wrap(&line) == 0) {
printf("%.*s\n", line.length, line.line);
}
return 0;
}
Solution1
A function that will be called successively until the whole string is processed: it would return the count of bytes to recopy to create the sub-strings:
The API:
/**
* Return the length between the beginning of the string and the
* last delimiter (such that returned length <= max_length)
*/
size_t get_next_substring_length(
const char * str, // The string to be splitted
const char * delim, // String of eligible delimiters for a split
size_t max_length); // The maximum length of resulting substring
On the client' side:
size_t shift = 0;
for(;;)
{
// Where do we start within big_str ?
const char * tmp = big_str + shift;
size_t count = get_next_substring_length(tmp, DEFAULT_DELIMITERS, MAX_LEN);
if(count)
{
// Allocate a sub-string and recopy "count" bytes
// Display the sub-string
shift += count;
}
else // End Of String (or error)
{
// Handle potential error
// Exit the loop
}
}
Solution2
Define a custom structure to store positions and lengths of sub-strings:
const char * str = "This is a long test string";
struct substrings
{
const char * str; // Beginning of the substring
size_t length; // Length of the substring
} sub[] = { {&str[0], 4},
{&str[5], 2},
{&str[8], 1},
{&str[10], 4},
{&str[15], 4},
{&str[20], 6},
{NULL, 0} };
The API:
size_t find_substrings(
struct substrings ** substr,
size_t max_length,
const char * delimiters,
const char * str);
On the client' side:
#define ARRAY_LENGTH 20U
struct substrings substr[ARRAY_LENGTH];
// Fill the structure
find_substrings(
&substr,
ARRAY_LENGTH,
DEFAULT_DELIMITERS,
big_str);
// Browse the structure
for (struct substrings * sub = &substr[0]; substr->str; sub++)
{
// Display sub->length bytes of sub->str
}
Some things are bothering me though:
in Solution1 I don't like the infinite loop, it is often bug prone
in Solution2 I fixed ARRAY_LENGTH arbitrarily but it should vary depending of input string length
I need to extract a value for a given key from a string. I made this quick attempt:
char js[] = "some preceding text with\n"
"new lines and spaces\n"
"param_1=123\n"
"param_2=321\n"
"param_3=string\n"
"param_2=321\n";
char* param_name = "param_2";
char *key_s, *val_s;
char buf[32];
key_s = strstr(js, param_name);
if (key_s == NULL)
return 0;
val_s = strchr(key_s, '=');
if (val_s == NULL)
return 0;
sscanf(val_s + 1, "%31s", buf);
printf("'%s'\n", buf);
And it in fact works ok (printf gives '321'). But I suppose the scanf/sscanf would make this task even easier but I have not managed to figure out the formatting string for that.
Is that possible to pass a content of a variable param_name into sscanf so that it evaluates it as a part of a formatting string? In other words, I need to instruct sscanf that in this case it should look for a pattern param_2=%s (the param_name in fact comes from a function argument).
Not directly, no.
In practice, there's of course nothing stopping you from building the format string for sscanf() at runtime, with e.g. snprintf().
Something like:
void print_value(const char **js, size_t num_js, const char *key)
{
char tmp[32], value[32];
snprintf(tmp, sizeof tmp, "%s=%%31s", key);
for(size_t i = 0; i < num_js; ++i)
{
if(sscanf(js[i], tmp, value) == 1)
{
printf("found '%s'\n", value);
break;
}
}
}
OP's has a good first step:
char *key_s = strstr(js, param_name);
if (key_s == NULL)
return 0;
The rest may be simplified to
if (sscanf(&key_s[strlen(param_name)], "=%31s", buf) == 0) {
return 0;
}
printf("'%s'\n", buf);
Alternatively one could use " =%31s" to allow spaces before =.
OP's approach gets fooled by "param_2 321\n" "param_3=string\n".
Note: Weakness to all answers so far to not parse the empty string.
One issue that bears consideration is the difference between finding a 'key=value' setting in the string for a specific key value (such as param_2 in the question), and finding any 'key=value' setting in the string (with no specific key in mind a priori). The techniques to be used are rather different.
Another issue that has not self-evidently been considered is the possibility that you're looking for a key param_2 but the string also contains param_22=xyz and t_param_2=abc. The simple-minded approaches using strstr() to hunt for param_2 will pick up either of those alternatives.
In the sample data, there is a collection of characters that are not in the 'key=value' format to be skipped before the any 'key=value' parts. In the general case, we should assume that such data appears before, in between, and after the 'key=value' pairs. It appears that the values do not need to support complications such as quoted strings and metacharacters, and the value is delimited by white space. There is no comment convention visible.
Here's some workable code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
enum { MAX_KEY_LEN = 31 };
enum { MAX_VAL_LEN = 63 };
int find_any_key_value(const char *str, char *key, char *value);
int find_key_value(const char *str, const char *key, char *value);
int find_any_key_value(const char *str, char *key, char *value)
{
char junk[256];
const char *search = str;
while (*search != '\0')
{
int offset;
if (sscanf(search, " %31[a-zA-Z_0-9]=%63s%n", key, value, &offset) == 2)
return(search + offset - str);
int rc;
if ((rc = sscanf(search, "%255s%n", junk, &offset)) != 1)
return EOF;
search += offset;
}
return EOF;
}
int find_key_value(const char *str, const char *key, char *value)
{
char found[MAX_KEY_LEN + 1];
int offset;
const char *search = str;
while ((offset = find_any_key_value(search, found, value)) > 0)
{
if (strcmp(found, key) == 0)
return(search + offset - str);
search += offset;
}
return offset;
}
int main(void)
{
char js[] = "some preceding text with\n"
"new lines and spaces\n"
"param_1=123\n"
"param_2=321\n"
"param_3=string\n"
"param_4=param_2=confusion\n"
"m= x\n"
"param_2=987\n";
const char p2_key[] = "param_2";
int offset;
const char *str;
char key[MAX_KEY_LEN + 1];
char value[MAX_VAL_LEN + 1];
printf("String being scanned is:\n[[%s]]\n", js);
str = js;
while ((offset = find_any_key_value(str, key, value)) > 0)
{
printf("Any found key = [%s] value = [%s]\n", key, value);
str += offset;
}
str = js;
while ((offset = find_key_value(str, p2_key, value)) > 0)
{
printf("Found key %s with value = [%s]\n", p2_key, value);
str += offset;
}
return 0;
}
Sample output:
$ ./so24490410
String being scanned is:
[[some preceding text with
new lines and spaces
param_1=123
param_2=321
param_3=string
param_4=param_2=confusion
m= x
param_2=987
]]
Any found key = [param_1] value = [123]
Any found key = [param_2] value = [321]
Any found key = [param_3] value = [string]
Any found key = [param_4] value = [param_2=confusion]
Any found key = [m] value = [x]
Any found key = [param_2] value = [987]
Found key param_2 with value = [321]
Found key param_2 with value = [987]
$
If you need to handle different key or value lengths, you need to adjust the format strings as well as the enumerations. If you pass the size of the key buffer and the size of the value buffer to the functions, then you need to use snprint() to create the format strings used by sscanf(). There is an outside chance that you might have a single 'word' of 255 characters followed immediately by the target 'key=value' string. The chances are ridiculously small, but you might decide you need to worry about that (it prevents this code being bomb-proof).