C - Char array SEEMS to copy, but only within loop scope - c

Right now, I'm attempting to familiarize myself with C by writing a function which, given a string, will replace all instances of a target substring with a new substring. However, I've run into a problem with a reallocation of a char* array. To my eyes, it seems as though I'm able to successfully reallocate the array string to a desired new size at the end of the main loop, then perform a strcpy to fill it with an updated string. However, it fails for the following scenario:
Original input for string: "use the restroom. Then I need"
Target to replace: "the" (case insensitive)
Desired replacement value: "th'"
At the end of the loop, the line printf("result: %s\n ",string); prints out the correct phrase "use th' restroom. Then I need". However, string seems to then reset itself: the call to strcasestr in the while() statement is successful, the line at the beginning of the loop printf("string: %s \n",string); prints the original input string, and the loop continues indefinitely.
Any ideas would be much appreciated (and I apologize in advance for my flailing debug printf statements). Thanks!
The code for the function is as follows:
int replaceSubstring(char *string, int strLen, char*oldSubstring,
int oldSublen, char*newSubstring, int newSublen )
{
printf("Starting replace\n");
char* strLoc;
while((strLoc = strcasestr(string, oldSubstring)) != NULL )
{
printf("string: %s \n",string);
printf("%d",newSublen);
char *newBuf = (char *) malloc((size_t)(strLen +
(newSublen - oldSublen)));
printf("got newbuf\n");
int stringIndex = 0;
int newBufIndex = 0;
char c;
while(true)
{
if(stringIndex > 500)
break;
if(&string[stringIndex] == strLoc)
{
int j;
for(j=0; j < newSublen; j++)
{
printf("new index: %d %c --> %c\n",
j+newBufIndex, newBuf[newBufIndex+j], newSubstring[j]);
newBuf[newBufIndex+j] = newSubstring[j];
}
stringIndex += oldSublen;
newBufIndex += newSublen;
}
else
{
printf("old index: %d %c --> %c\n", stringIndex,
newBuf[newBufIndex], string[stringIndex]);
newBuf[newBufIndex] = string[stringIndex];
if(string[stringIndex] == '\0')
break;
newBufIndex++;
stringIndex++;
}
}
int length = (size_t)(strLen + (newSublen - oldSublen));
string = (char*)realloc(string,
(size_t)(strLen + (newSublen - oldSublen)));
strcpy(string, newBuf);
printf("result: %s\n ",string);
free(newBuf);
}
printf("end result: %s ",string);
}

At first the task should be clarified regarding desired behavior and interface.
The topic "Char array..." is not clear.
You provide strLen, oldSublen newSublen, so it looks that you indeed want to work just with bulk memory buffers with given length.
However, you use strcasestr, strcpy and string[stringIndex] == '\0' and also mention printf("result: %s\n ",string);.
So I assume that you want to work with "null terminated strings" that can be passed by the caller as string literals: "abc".
It is not needed to pass all those lengths to the function.
It looks that you are trying to implement recursive string replacement. After each replacement you start from the beginning.
Let's consider more complicated sets of parameters, for example, replace aba by ab in abaaba.
Case 1: single pass through input stream
Each of both old substrings can be replaced: "abaaba" => "abab"
That is how the standard sed string replacement works:
> echo "abaaba" | sed 's/aba/ab/g'
abab
Case 2: recursive replacement taking into account possible overlapping
The first replacement: "abaaba" => "ababa"
The second replacement in already replaced result: "ababa" => "abba"
Note that this case is not safe, for example replace "loop" by "loop loop". It is an infinite loop.
Suppose we want to implement a function that takes null terminated strings and the replacement is done in one pass as with sed.
In general the replacement cannot be done in place of input string (in the same memory).
Note that realloc may allocate new memory block with new address, so you should return that address to the caller.
For implementation simplicity it is possible to calculate required space for result before memory allocation (Case 1 implementation). So reallocation is not needed:
#define _GNU_SOURCE
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
char* replaceSubstring(const char* string, const char* oldSubstring,
const char* newSubstring)
{
size_t strLen = strlen(string);
size_t oldSublen = strlen(oldSubstring);
size_t newSublen = strlen(newSubstring);
const char* strLoc = string;
size_t replacements = 0;
/* count number of replacements */
while ((strLoc = strcasestr(strLoc, oldSubstring)))
{
strLoc += oldSublen;
++replacements;
}
/* result size: initial size + replacement diff + sizeof('\0') */
size_t result_size = strLen + (newSublen - oldSublen) * replacements + 1;
char* result = malloc(result_size);
if (!result)
return NULL;
char* resCurrent = result;
const char* strCurrent = string;
strLoc = string;
while ((strLoc = strcasestr(strLoc, oldSubstring)))
{
memcpy(resCurrent, strCurrent, strLoc - strCurrent);
resCurrent += strLoc - strCurrent;
memcpy(resCurrent, newSubstring, newSublen);
resCurrent += newSublen;
strLoc += oldSublen;
strCurrent = strLoc;
}
strcpy(resCurrent, strCurrent);
return result;
}
int main()
{
char* res;
res = replaceSubstring("use the restroom. Then I need", "the", "th");
printf("%s\n", res);
free(res);
res = replaceSubstring("abaaba", "aba", "ab");
printf("%s\n", res);
free(res);
return 0;
}

Related

Cannot access empty string from array of strings in C

I'm using an array of strings in C to hold arguments given to a custom shell. I initialize the array of buffers using:
char *args[MAX_CHAR];
Once I parse the arguments, I send them to the following function to determine the type of IO redirection if there are any (this is just the first of 3 functions to check for redirection and it only checks for STDIN redirection).
int parseInputFile(char **args, char *inputFilePath) {
char *inputSymbol = "<";
int isFound = 0;
for (int i = 0; i < MAX_ARG; i++) {
if (strlen(args[i]) == 0) {
isFound = 0;
break;
}
if ((strcmp(args[i], inputSymbol)) == 0) {
strcpy(inputFilePath, args[i+1]);
isFound = 1;
break;
}
}
return isFound;
}
Once I compile and run the shell, it crashes with a SIGSEGV. Using GDB I determined that the shell is crashing on the following line:
if (strlen(args[i]) == 0) {
This is because the address of arg[i] (the first empty string after the parsed commands) is inaccessible. Here is the error from GDB and all relevant variables:
(gdb) next
359 if (strlen(args[i]) == 0) {
(gdb) p args[0]
$1 = 0x7fffffffe570 "echo"
(gdb) p args[1]
$2 = 0x7fffffffe575 "test"
(gdb) p args[2]
$3 = 0x0
(gdb) p i
$4 = 2
(gdb) next
Program received signal SIGSEGV, Segmentation fault.
parseInputFile (args=0x7fffffffd570, inputFilePath=0x7fffffffd240 "") at shell.c:359
359 if (strlen(args[i]) == 0) {
I believe that the p args[2] returning $3 = 0x0 means that because the index has yet to be written to, it is mapped to address 0x0 which is out of the bounds of execution. Although I can't figure out why this is because it was declared as a buffer. Any suggestions on how to solve this problem?
EDIT: Per Kaylum's comment, here is a minimal reproducible example
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
#include<unistd.h>
#include<sys/types.h>
#include<sys/wait.h>
#include <sys/stat.h>
#include<readline/readline.h>
#include<readline/history.h>
#include <fcntl.h>
// Defined values
#define MAX_CHAR 256
#define MAX_ARG 64
#define clear() printf("\033[H\033[J") // Clear window
#define DEFAULT_PROMPT_SUFFIX "> "
char PROMPT[MAX_CHAR], SPATH[1024];
int parseInputFile(char **args, char *inputFilePath) {
char *inputSymbol = "<";
int isFound = 0;
for (int i = 0; i < MAX_ARG; i++) {
if (strlen(args[i]) == 0) {
isFound = 0;
break;
}
if ((strcmp(args[i], inputSymbol)) == 0) {
strcpy(inputFilePath, args[i+1]);
isFound = 1;
break;
}
}
return isFound;
}
int ioRedirectHandler(char **args) {
char inputFilePath[MAX_CHAR] = "";
// Check if any redirects exist
if (parseInputFile(args, inputFilePath)) {
return 1;
} else {
return 0;
}
}
void parseArgs(char *cmd, char **cmdArgs) {
int na;
// Separate each argument of a command to a separate string
for (na = 0; na < MAX_ARG; na++) {
cmdArgs[na] = strsep(&cmd, " ");
if (cmdArgs[na] == NULL) {
break;
}
if (strlen(cmdArgs[na]) == 0) {
na--;
}
}
}
int processInput(char* input, char **args, char **pipedArgs) {
// Parse the single command and args
parseArgs(input, args);
return 0;
}
int getInput(char *input) {
char *buf, loc_prompt[MAX_CHAR] = "\n";
strcat(loc_prompt, PROMPT);
buf = readline(loc_prompt);
if (strlen(buf) != 0) {
add_history(buf);
strcpy(input, buf);
return 0;
} else {
return 1;
}
}
void init() {
char *uname;
clear();
uname = getenv("USER");
printf("\n\n \t\tWelcome to Student Shell, %s! \n\n", uname);
// Initialize the prompt
snprintf(PROMPT, MAX_CHAR, "%s%s", uname, DEFAULT_PROMPT_SUFFIX);
}
int main() {
char input[MAX_CHAR];
char *args[MAX_CHAR], *pipedArgs[MAX_CHAR];
int isPiped = 0, isIORedir = 0;
init();
while(1) {
// Get the user input
if (getInput(input)) {
continue;
}
isPiped = processInput(input, args, pipedArgs);
isIORedir = ioRedirectHandler(args);
}
return 0;
}
Note: If I forgot to include any important information, please let me know and I can get it updated.
When you write
char *args[MAX_CHAR];
you allocate room for MAX_CHAR pointers to char. You do not initialise the array. If it is a global variable, you will have initialised all the pointers to NULL, but you do it in a function, so the elements in the array can point anywhere. You should not dereference them before you have set the pointers to point at something you are allowed to access.
You also do this, though, in parseArgs(), where you do this:
cmdArgs[na] = strsep(&cmd, " ");
There are two potential issues here, but let's deal with the one you hit first. When strsep() is through the tokens you are splitting, it returns NULL. You test for that to get out of parseArgs() so you already know this. However, where your program crashes you seem to have forgotten this again. You call strlen() on a NULL pointer, and that is a no-no.
There is a difference between NULL and the empty string. An empty string is a pointer to a buffer that has the zero char first; the string "" is a pointer to a location that holds the character '\0'. The NULL pointer is a special value for pointers, often address zero, that means that the pointer doesn't point anywhere. Obviously, the NULL pointer cannot point to an empty string. You need to check if an argument is NULL, not if it is the empty string.
If you want to check both for NULL and the empty string, you could do something like
if (!args[i] || strlen(args[i]) == 0) {
If args[i] is NULL then !args[i] is true, so you will enter the if body if you have NULL or if you have a pointer to an empty string.
(You could also check the empty string with !(*args[i]); *args[i] is the first character that args[i] points at. So *args[i] is zero if you have the empty string; zero is interpreted as false, so !(*args[i]) is true if and only if args[i] is the empty string. Not that this is more readable, but it shows again the difference between empty strings and NULL).
I mentioned another issue with the parsed arguments. Whether it is a problem or not depends on the application. But when you parse a string with strsep(), you get pointers into the parsed string. You have to be careful not to free that string (it is input in your main() function) or to modify it after you have parsed the string. If you change the string, you have changed what all the parsed strings look at. You do not do this in your program, so it isn't a problem here, but it is worth keeping in mind. If you want your parsed arguments to survive longer than they do now, after the next command is passed, you need to copy them. The next command that is passed will change them as it is now.
In main
char input[MAX_CHAR];
char *args[MAX_CHAR], *pipedArgs[MAX_CHAR];
are all uninitialized. They contain indeterminate values. This could be a potential source of bugs, but is not the reason here, as
getInput modifies the contents of input to be a valid string before any reads occur.
pipedArgs is unused, so raises no issues (yet).
args is modified by parseArgs to (possibly!) contain a NULL sentinel value, without any indeterminate pointers being read first.
Firstly, in parseArgs it is possible to completely fill args without setting the NULL sentinel value that other parts of the program should rely on.
Looking deeper, in parseInputFile the following
if (strlen(args[i]) == 0)
contradicts the limits imposed by parseArgs that disallows empty strings in the array. More importantly, args[i] may be the sentinel NULL value, and strlen expects a non-NULL pointer to a valid string.
This termination condition should simply check if args[i] is NULL.
With
strcpy(inputFilePath, args[i+1]);
args[i+1] might also be the NULL sentinel value, and strcpy also expects non-NULL pointers to valid strings. You can see this in action when inputSymbol is a match for the final token in the array.
args[i+1] may also evaluate as args[MAX_ARGS], which would be out of bounds.
Additionally, inputFilePath has a string length limit of MAX_CHAR - 1, and args[i+1] is (possibly!) a dynamically allocated string whose length might exceed this.
Some edge cases found in getInput:
Both arguments to
strcat(loc_prompt, PROMPT);
are of the size MAX_CHAR. Since loc_prompt has a length of 1. If PROMPT has the length MAX_CHAR - 1, the resulting string will have the length MAX_CHAR. This would leave no room for the NUL terminating byte.
readline can return NULL in some situations, so
buf = readline(loc_prompt);
if (strlen(buf) != 0) {
can again pass the NULL pointer to strlen.
A similar issue as before, on success readline returns a string of dynamic length, and
strcpy(input, buf);
can cause a buffer overflow by attempting to copy a string greater in length than MAX_CHAR - 1.
buf is a pointer to data allocated by malloc. It's unclear what add_history does, but this pointer must eventually be passed to free.
Some considerations.
Firstly, it is a good habit to initialize your data, even if it might not matter.
Secondly, using constants (#define MAX_CHAR 256) might help to reduce magic numbers, but they can lead you to design your program too rigidly if used in the same way.
Consider building your functions to accept a limit as an argument, and return a length. This allows you to more strictly track the sizes of your data, and prevents you from always designing around the maximum potential case.
A slightly contrived example of designing like this. We can see that find does not have to concern itself with possibly checking MAX_ARGS elements, as it is told precisely how long the list of valid elements is.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_ARGS 100
char *get_input(char *dest, size_t sz, const char *display) {
char *res;
if (display)
printf("%s", display);
if ((res = fgets(dest, sz, stdin)))
dest[strcspn(dest, "\n")] = '\0';
return res;
}
size_t find(char **list, size_t length, const char *str) {
for (size_t i = 0; i < length; i++)
if (strcmp(list[i], str) == 0)
return i;
return length;
}
size_t split(char **list, size_t limit, char *source, const char *delim) {
size_t length = 0;
char *token;
while (length < limit && (token = strsep(&source, delim)))
if (*token)
list[length++] = token;
return length;
}
int main(void) {
char input[512] = { 0 };
char *args[MAX_ARGS] = { 0 };
puts("Welcome to the shell.");
while (1) {
if (get_input(input, sizeof input, "$ ")) {
size_t argl = split(args, MAX_ARGS, input, " ");
size_t redirection = find(args, argl, "<");
puts("Command parts:");
for (size_t i = 0; i < redirection; i++)
printf("%zu: %s\n", i, args[i]);
puts("Input files:");
if (redirection == argl)
puts("[[NONE]]");
else for (size_t i = redirection + 1; i < argl; i++)
printf("%zu: %s\n", i, args[i]);
}
}
}

Replace underscore with white spaces and make the first letter of the name and the surname to upper case

i want to replace _ (underscore) with white spaces and make the first letter of the name and the surname to upper case while printing the nameList in searchKeyword method.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void searchKeyword(const char * nameList[], int n, const char keyword[])
{
int i,name=0;
char *str;
const char s[2] = " " ;
for(i=0;i<n;i++)
{
char *str = (char *) malloc((strlen(nameList[0])+1)*sizeof(char));
strcpy(str,nameList[i]);
strtok(str,"_");
if(strcmp(keyword,strtok(NULL,"_"))==0) // argument NULL will start string
{ // from last point of previous string
name++;
if(nameList[i] == '_')
strcpy(nameList[i],s);
//nameList[i] = ' ';
printf("%s\n",nameList[i]);
}
}
if(name==0)
{
printf("No such keyword found\n");
}
free(str); //deallocating space
}
int main()
{
char p1[] = "zoe_bale";
char p2[] = "sam_rodriguez";
char p3[] = "jack_alonso";
char p4[] = "david_studi";
char p5[] = "denzel_feldman";
char p6[] = "james_bale";
char p7[] = "james_willis";
char p8[] = "michael_james";
char p9[] = "dustin_bale";
const char * nameList[9] = {p1, p2, p3, p4, p5, p6, p7, p8, p9};
char keyword[100];
printf("Enter a keyword: ");
scanf("%s", keyword);
printf("\n");
searchKeyword(nameList, 9, keyword);
printf("\n");
for (int i = 0; i < 9; i++)
printf("%s\n",nameList[i]);
return 0;
}
Search through the strings and print the ones whose surname part is equal to keyword.
As shown in the example runs below, the strings are printed in “Name Surname” format (the first letters are capitalized).
Output should be like this:
Enter a keyword: james
Michael James
zoe_bale
sam_rodriguez
jack_alonso
david_studi
denzel_feldman
james_bale
james_willis
michael_james
dustin_bale
There is no reason to dynamically allocate storage for your name and surname. Looking at your input, neither will exceed 9-characters, so simply using an array for each of 64-chars provides 6X the storage required (if you are unsure, double that to 128-chars and have 1200% additional space). That avoids the comparatively expensive calls to malloc.
To check whether keyword exists in nameList[i], you don't need to separate the values first and then compare. Simply use strstr (nameList[i], keyword) to determine if keyword is contained in nameList[i]. If you then want to match only the name or surname you can compare again after they are separated. (up to you)
To parse the names from the nameList[i] string, all you need is a single pointer to locate the '_' character. A simple call to strchr() will do and it does not modify nameList[i] so there is no need to duplicate.
After using strchr() to locate the '_' character, simply memcpy() from the start of nameList[i] to your pointer to your name array, increment the pointer and then strcpy() from p to surname. Now you have separated name and surname, simply call toupper() on the first character of each and then output the names separate by a space, e.g.
...
#include <ctype.h>
#define NLEN 64
void searchKeyword (const char *nameList[], int n, const char keyword[])
{
for (int i = 0; i < n; i++) { /* loop over each name in list */
if (strstr (nameList[i], keyword)) { /* does name contain keyword? */
char name[NLEN], surname[NLEN]; /* storage for name, surname */
const char *p = nameList[i]; /* pointer to parse nameList[i] */
if ((p = strchr(p, '_'))) { /* find '_' in nameList[i] */
/* copy first-name to name */
memcpy (name, nameList[i], p - nameList[i]);
name[p++ - nameList[i]] = 0; /* nul-terminate first name */
*name = toupper (*name); /* convert 1st char to uppwer */
/* copy last name to surname */
strcpy (surname, p);
*surname = toupper (*surname); /* convert 1st char to upper */
printf ("%s %s\n", name, surname); /* output "Name Surname" */
}
}
}
}
Example Use/Output
Used with the remainder of your code, searching for "james" locates those names containing "james" and provides what looks like the output you requested, e.g.
$ ./bin/keyword_surname
Enter a keyword: james
James Bale
James Willis
Michael James
zoe_bale
sam_rodriguez
jack_alonso
david_studi
denzel_feldman
james_bale
james_willis
michael_james
dustin_bale
(note: to match only the name or surname add an additional strcmp before the call to printf to determine which you want to output)
Notes On Your Existing Code
Additional notes continuing from the comments on your existing code,
char *str = (char *) malloc((strlen(nameList[0])+1)*sizeof(char));
should simply be
str = malloc (strlen (nameList[i]) + 1);
You have previously declared char *str; so the declaration before your call to malloc() shadows your previous declaration. If you are using gcc/clang, you can add -Wshadow to your compile string to ensure you are warned of shadowed variables. (they can have dire consequences in other circumstances)
Next, sizeof (char) is always 1 and should be omitted from your size calculation. There is no need to cast the return of malloc() in C. See: Do I cast the result of malloc?
Your comparison if (nameList[i] == '_') is a comparison between a pointer and integer and will not work. Your compiler should be issuing a diagnostic telling you that is incorrect (do not ignore compiler warnings -- do not accept code until it compiles without warning)
Look things over and let me know if you have further questions.
that worked for me and has no memory leaks.
void searchKeyword(const char * nameList[], int n, const char keyword[])
{
int found = 0;
const char delim = '_';
for (int i = 0; i < n; i++) {
const char *fst = nameList[i];
for (const char *tmp = fst; *tmp != '\0'; tmp++) {
if (*tmp == delim) {
const char *snd = tmp + 1;
int fst_length = (snd - fst) / sizeof(char) - 1;
int snd_length = strlen(fst) - fst_length - 1;
if (strncmp(fst, keyword, fst_length) == 0 ||
strncmp(snd, keyword, snd_length) == 0) {
found = 1;
printf("%c%.*s %c%s\n",
fst[0]-32, fst_length-1, fst+1,
snd[0]-32, snd+1);
}
break;
}
}
}
if (!found)
puts("No such keyword found");
}
hopefully it's fine for you too, although I use string.h-functions very rarely.

Extracting a string between two similar (or different) strings in C as fast as possible

I made a program in C that can find two similar or different strings and extract the string between them. This type of program has so many uses, and generally when you use such a program, you have a lot of info, so it needs to be fast. I would like tips on how to make this program as fast and efficient as possible.
I am looking for suggestions that won't make me resort to heavy libraries (such as regex).
The code must:
be able to extract a string between two similar or different strings
find the 1st occurrence of string1
find the 1st occurrence of string2 which occurs AFTER string1
extract the string between string1 and string2
be able to use string arguments of any size
be foolproof to human error and return NULL if such occurs (example, string1 exceeds entire text string length. don't crash in an element error, but gracefully return NULL)
focus on speed and efficiency
Below is my code. I am quite new to C, coming from C++, so I could probably use a few suggestions, especially regarding efficient/proper use of the 'malloc' command:
fast_strbetween.c:
/*
Compile with:
gcc -Wall -O3 fast_strbetween.c -o fast_strbetween
*/
#include <stdio.h> // printf
#include <stdlib.h> // malloc
// inline function if it pleases the compiler gods
inline size_t fast_strlen(char *str)
{
int i; // Cannot return 'i' if inside for loop
for(i = 0; str[i] != '\0'; ++i);
return i;
}
char *fast_strbetween(char *str, char *str1, char *str2)
{
// size_t segfaults when incorrect length strings are entered (due to going below 0), so use int instead for increased robustness
int str0len = fast_strlen(str);
int str1len = fast_strlen(str1);
int str1pos = 0;
int charsfound = 0;
// Find str1
do {
charsfound = 0;
while (str1[charsfound] == str[str1pos + charsfound])
++charsfound;
} while (++str1pos < str0len - str1len && charsfound < str1len);
// '++str1pos' increments past by 1: needs to be set back by one
--str1pos;
// Whole string not found or logical impossibilty
if (charsfound < str1len)
return NULL;
/* Start searching 2 characters after last character found in str1. This will ensure that there will be space, and logical possibility, for the extracted text to exist or not, and allow immediate bail if the latter case; str1 cannot possibly have anything between it if str2 is right next to it!
Example:
str = 'aa'
str1 = 'a'
str2 = 'a'
returned = '' (should be NULL)
Without such preventative, str1 and str2 would would be found and '' would be returned, not NULL. This also saves 1 do/while loop, one check pertaining to returning null, and two additional calculations:
Example, if you didn't add +1 str2pos, you would need to change the code to:
if (charsfound < str2len || str2pos - str1pos - str1len < 1)
return NULL;
It also allows for text to be found between three similar strings—what??? I can feel my brain going fuzzy!
Let this example explain:
str = 'aaa'
str1 = 'a'
str2 = 'a'
result = '' (should be 'a')
Without the aforementioned preventative, the returned string is '', not 'a'; the program takes the first 'a' for str1 and the second 'a' for str2, and tries to return what is between them (nothing).
*/
int str2pos = str1pos + str1len + 1; // the '1' added to str2pos
int str2len = fast_strlen(str2);
// Find str2
do {
charsfound = 0;
while (str2[charsfound] == str[str2pos + charsfound])
++charsfound;
} while (++str2pos < str0len - str2len + 1 && charsfound < str2len);
// Deincrement due to '++str2pos' over-increment
--str2pos;
if (charsfound < str2len)
return NULL;
// Only allocate what is needed
char *strbetween = (char *)malloc(sizeof(char) * str2pos - str1pos - str1len);
unsigned int tmp = 0;
for (unsigned int i = str1pos + str1len; i < str2pos; i++)
strbetween[tmp++] = str[i];
return strbetween;
}
int main() {
char str[30] = { "abaabbbaaaabbabbbaaabbb" };
char str1[10] = { "aaa" };
char str2[10] = { "bbb" };
//Result should be: 'abba'
printf("The string between is: \'%s\'\n", fast_strbetween(str, str1, str2));
// free malloc as we go
for (int i = 10000000; --i;)
free(fast_strbetween(str, str1, str2));
return 0;
}
In order to have some way of measuring progress, I have already timed the code above (extracting a small string 10000000 times):
$ time fast_strbetween
The string between is: 'abba'
0m11.09s real 0m11.09s user 0m00.00s system
Process used 99.3 - 100% CPU according to 'top' command (Linux).
Memory used while running: 3.7Mb
Executable size: 8336 bytes
Ran on a Raspberry Pi 3B+ (4 x 1.4Ghz, Arm 6)
If anyone would like to offer code, tips, pointers... I would appreciate it. I will also implement the changes and give a timed result for your troubles.
Oh, and one thing that I learned is to always de-allocate malloc; I ran the code above (with extra loops), just before posting this. My computer's ram filled up, and the computer froze. Luckily, Stack made a backup draft! Lesson learned!
* EDIT *
Here is the revised code using chqrlie's advice as best I could. Added extra checks for end of string, which ended up costing about a second of time with the tested phrase but can now bail very fast if the first string is not found. Using null or illogical strings should not result in error, hopefully. Lots of notes int the code, where they can be better understood. If I've left anything thing out or done something incorrectly, please let me know guys; it is not intentional.
fast_strbetween2.c:
/*
Compile with:
gcc -Wall -O3 fast_strbetween2.c -o fast_strbetween2
Corrections and additions courtesy of:
https://stackoverflow.com/questions/55308295/extracting-a-string-between-two-similar-or-different-strings-in-c-as-fast-as-p
*/
#include<stdio.h> // printf
#include<stdlib.h> // malloc, free
// Strings now set to 'const'
char * fast_strbetween(const char *str, const char *str1, const char *str2)
{
// string size will now be calculated by the characters picked up
size_t str1pos = 0;
size_t str1chars;
// Find str1
do{
str1chars = 0;
// Will the do/while str1 check for '\0' suffice?
// I haven't seen any issues yet, but not sure.
while(str1[str1chars] == str[str1pos + str1chars] && str1[str1chars] != '\0')
{
//printf("Found str1 char: %i num: %i pos: %i\n", str1[str1chars], str1chars + 1, str1pos);
++str1chars;
}
// Incrementing whilst not in conditional expression tested faster
++str1pos;
/* There are two checks for "str1[str1chars] != '\0'". Trying to find
another efficient way to do it in one. */
}while(str[str1pos] != '\0' && str1[str1chars] != '\0');
--str1pos;
//For testing:
//printf("str1pos: %i str1chars: %i\n", str1pos, str1chars);
// exit if no chars were found or if didn't reach end of str1
if(!str1chars || str1[str1chars] != '\0')
{
//printf("Bailing from str1 result\n");
return '\0';
}
/* Got rid of the '+1' code which didn't allow for '' returns.
I agree with your logic of <tag></tag> returning ''. */
size_t str2pos = str1pos + str1chars;
size_t str2chars;
//printf("Starting pos for str2: %i\n", str1pos + str1chars);
// Find str2
do{
str2chars = 0;
while(str2[str2chars] == str[str2pos + str2chars] && str2[str2chars] != '\0')
{
//printf("Found str2 char: %i num: %i pos: %i \n", str2[str2chars], str2chars + 1, str2pos);
++str2chars;
}
++str2pos;
}while(str[str2pos] != '\0' && str2[str2chars] != '\0');
--str2pos;
//For testing:
//printf("str2pos: %i str2chars: %i\n", str2pos, str2chars);
if(!str2chars || str2[str2chars] != '\0')
{
//printf("Bailing from str2 result!\n");
return '\0';
}
/* Trying to allocate strbetween with malloc. Is this correct? */
char * strbetween = malloc(2);
// Check if malloc succeeded:
if (strbetween == '\0') return '\0';
size_t tmp = 0;
// Grab and store the string between!
for(size_t i = str1pos + str1chars; i < str2pos; ++i)
{
strbetween[tmp] = str[i];
++tmp;
}
return strbetween;
}
int main() {
char str[30] = { "abaabbbaaaabbabbbaaabbb" };
char str1[10] = { "aaa" };
char str2[10] = { "bbb" };
printf("Searching \'%s\' for \'%s\' and \'%s\'\n", str, str1, str2);
printf(" 0123456789\n\n"); // Easily see the elements
printf("The word between is: \'%s\'\n", fast_strbetween(str, str1, str2));
for(int i = 10000000; --i;)
free(fast_strbetween(str, str1, str2));
return 0;
}
** Results **
$ time fast_strbetween2
Searching 'abaabbbaaaabbabbbaaabbb' for 'aaa' and 'bbb'
0123456789
The word between is: 'abba'
0m10.93s real 0m10.93s user 0m00.00s system
Process used 99.0 - 100% CPU according to 'top' command (Linux).
Memory used while running: 1.8Mb
Executable size: 8336 bytes
Ran on a Raspberry Pi 3B+ (4 x 1.4Ghz, Arm 6)
chqrlie's answer
I understand that this is just some example code that shows proper programming practices. Nonetheless, it can make for a decent control in testing.
Please note that I do not know how to deallocate malloc in your code, so it is NOT a fair test. As a result, ram usage builds up, taking 130Mb+ for the process alone. I was still able to run the test for the full 10000000 loops. I will say that I tried deallocating this code the way I did my code (via bringing the function 'simple_strbetween' down into main and deallocating with 'free(strndup(p, q - p));'), and the results weren't much different from not deallocating.
** simple_strbetween.c **
/*
Compile with:
gcc -Wall -O3 simple_strbetween.c -o simple_strbetween
Courtesy of:
https://stackoverflow.com/questions/55308295/extracting-a-string-between-two-similar-or-different-strings-in-c-as-fast-as-p
*/
#include<string.h>
#include<stdio.h>
char *simple_strbetween(const char *str, const char *str1, const char *str2) {
const char *q;
const char *p = strstr(str, str1);
if (p) {
p += strlen(str1);
q = *str2 ? strstr(p, str2) : p + strlen(p);
if (q)
return strndup(p, q - p);
}
return NULL;
}
int main() {
char str[30] = { "abaabbbaaaabbabbbaaabbb" };
char str1[10] = { "aaa" };
char str2[10] = { "bbb" };
printf("Searching \'%s\' for \'%s\' and \'%s\'\n", str, str1, str2);
printf(" 0123456789\n\n"); // Easily see the elements
printf("The word between is: \'%s\'\n", simple_strbetween(str, str1, str2));
for(int i = 10000000; --i;)
simple_strbetween(str, str1, str2);
return 0;
}
$ time simple_strbetween
Searching 'abaabbbaaaabbabbbaaabbb' for 'aaa' and 'bbb'
0123456789
The word between is: 'abba'
0m19.68s real 0m19.34s user 0m00.32s system
Process used 100% CPU according to 'top' command (Linux).
Memory used while running: 130Mb (leak due do my lack of knowledge)
Executable size: 8380 bytes
Ran on a Raspberry Pi 3B+ (4 x 1.4Ghz, Arm 6)
Results for above code ran with this alternate strndup:
char *alt_strndup(const char *s, size_t n)
{
size_t i;
char *p;
for (i = 0; i < n && s[i] != '\0'; i++)
continue;
p = malloc(i + 1);
if (p != NULL) {
memcpy(p, s, i);
p[i] = '\0';
}
return p;
}
$ time simple_strbetween
Searching 'abaabbbaaaabbabbbaaabbb' for 'aaa' and 'bbb'
0123456789
The word between is: 'abba'
0m20.99s real 0m20.54s user 0m00.44s system
I kindly ask that nobody make judgements on the results until the code is properly ran. I will revise the results as soon as it is figured out.
* Edit *
Was able to decrease the time by over 25% (11.93s vs 8.7s). This was done by using pointers to increment the positions, as opposed to size_t. Collecting the return string whilst checking the last string was likely what caused the biggest change. I feel there is still lots of room for improvement. A big loss comes from having to free malloc. If there is a better way, I'd like to know.
fast_strbetween3.c:
/*
gcc -Wall -O3 fast_strbetween.c -o fast_strbetween
*/
#include<stdio.h> // printf
#include<stdlib.h> // malloc, free
char * fast_strbetween(const char *str, const char *str1, const char *str2)
{
const char *sbegin = &str1[0]; // String beginning
const char *spos;
// Find str1
do{
spos = str;
str1 = sbegin;
while(*spos == *str1 && *str1)
{
++spos;
++str1;
}
++str;
}while(*str1 && *spos);
// Nothing found if spos hasn't advanced
if (spos == str)
return NULL;
char *strbetween = malloc(1);
if (!strbetween)
return '\0';
str = spos;
int i = 0;
//char *p = &strbetween[0]; // Alt. for advancing strbetween (slower)
sbegin = &str2[0]; // Recycle sbegin
// Find str2
do{
str2 = sbegin;
spos = str;
while(*spos == *str2 && *str2)
{
++str2;
++spos;
}
//*p = *str;
//++p;
strbetween[i] = *str;
++str;
++i;
}while(*str2 && *spos);
if (spos == str)
return NULL;
//*--p = '\0';
strbetween[i - 1] = '\0';
return strbetween;
}
int main() {
char s[100] = "abaabbbaaaabbabbbaaabbb";
char s1[100] = "aaa";
char s2[100] = "bbb";
printf("\nString: \'%s\'\n", fast_strbetween(s, s1, s2));
for(int i = 10000000; --i; )
free(fast_strbetween(s, s1, s2));
return 0;
}
String: 'abba'
0m08.70s real 0m08.67s user 0m00.01s system
Process used 99.0 - 100% CPU according to 'top' command (Linux).
Memory used while running: 1.8Mb
Executable size: 8336 bytes
Ran on a Raspberry Pi 3B+ (4 x 1.4Ghz, Arm 6)
* Edit *
This doesn't really count as it does not 'return' a value, and therefore is against my own rules, but it does pass a variable through, which is changed and brought back to main. It runs with 1 library and takes 3.6s. Getting rid of malloc was the key.
/*
gcc -Wall -O3 fast_strbetween.c -o fast_strbetween
*/
#include<stdio.h> // printf
unsigned int fast_strbetween(const char *str, const char *str1, const char *str2, char *strbetween)
{
const char *sbegin = &str1[0]; // String beginning
const char *spos;
// Find str1
do{
spos = str;
str1 = sbegin;
while(*spos == *str1 && *str1)
{
++spos;
++str1;
}
++str;
}while(*str1 && *spos);
// Nothing found if spos hasn't advanced
if (spos == str)
{
strbetween[0] = '\0';
return 0;
}
str = spos;
sbegin = &str2[0]; // Recycle sbegin
// Find str2
do{
str2 = sbegin;
spos = str;
while(*spos == *str2 && *str2)
{
++str2;
++spos;
}
*strbetween = *str;
++strbetween;
++str;
}while(*str2 && *spos);
if (spos == str)
{
strbetween[0] = '\0';
return 0;
}
*--strbetween = '\0';
return 1; // Successful (found text)
}
int main() {
char s[100] = "abaabbbaaaabbabbbaaabbb";
char s1[100] = "aaa";
char s2[100] = "bbb";
char sret[100];
fast_strbetween(s, s1, s2, sret);
printf("String: %s\n", sret);
for(int i = 10000000; --i; )
fast_strbetween(s, s1, s2, sret);
return 0;
}
Your code has multiple problems and is probably not as efficient as it should be:
you use types int and unsigned int for indexes into the strings. These types may be smaller than the range of size_t. You should revise your code to use size_t and avoid mixing signed and unsigned types in comparisons.
your functions' string arguments should be declared as const char * as you do not modify the strings and should be able to pass const strings without a warning.
redefining strlen is a bad idea: your version will be slower than the system's optimized, assembly coded and very likely inlined version.
computing the length of str is unnecessary and potentially costly: both str1 and str2 may appear close to the beginning of str, scanning for the end of str will be wasteful.
the while loop inside the first do / while loop is incorrect: while(str1[charsfound] == str[str1pos + charsfound]) charsfound++; may access characters beyond the end of str and str1 as the loop does not stop at the null terminator. If str1 only appears at the end of str, you have undefined behavior.
if str1 is an empty string, you will find it at the end of str instead of at the beginning.
why do you initialize str2pos as int str2pos = str1pos + str1len + 1;? If str2 immediately follows str1 inside str, an empty string should be allocated and returned. Your comment regarding this case is unreadable, you should break such long lines to fit within a typical screen width such as 80 columns. It is debatable whether strbetween("aa", "a", "a") should return "" or NULL. IMHO it should return an allocated empty string, which would be consistent with the expected behavior on strbetween("<name></name>", "<name>", "</name>") or strbetween("''", "'", "'"). Your specification preventing strbetween from returning an empty string produces a counter-intuitive border case.
the second scanning loop has the same problems as the first.
the line char *strbetween = (char *) malloc(sizeof(char) * str2pos - str1pos - str1len); has multiple problems: no cast is necessary in C, if you insist on specifying the element size sizeof(char), which is 1 by definition, you should parenthesize the number of elements, and last but not least, you must allocate one extra element for the null terminator.
You do not test if malloc() succeeded. If it returns NULL, you will have undefined behavior, whereas you should just return NULL.
the copying loop uses a mix of signed and unsigned types, causing potentially counterintuitive behavior on overflow.
you forget to set the null terminator, which is consistent with the allocation size error, but incorrect.
Before you try and optimize code, you must ensure correctness! Your code is too complicated and has multiple flaws. Optimisation is a moot point.
You should first try a very simple implementation using standard C string functions: searching a string inside another one is performed efficiently by strstr.
Here is a simple implementation using strstr and strndup(), which should be available on your system:
#include <string.h>
char *simple_strbetween(const char *str, const char *str1, const char *str2) {
const char *q;
const char *p = strstr(str, str1);
if (p) {
p += strlen(str1);
q = *str2 ? strstr(p, str2) : p + strlen(p);
if (q)
return strndup(p, q - p);
}
return NULL;
}
strndup() is defined in POSIX and is part of the Extensions to the C Library Part II: Dynamic Allocation Functions, ISO/IEC TR 24731-2:2010. If it is not available on your system, it can be redefined as:
#include <stdlib.h>
#include <string.h>
char *strndup(const char *s, size_t n) {
size_t i;
char *p;
for (i = 0; i < n && s[i] != '\0'; i++)
continue;
p = malloc(i + 1);
if (p != NULL) {
memcpy(p, s, i);
p[i] = '\0';
}
return p;
}
To ensure correctness, write a number of test cases, with border cases such as all combinations of empty strings and identical strings.
Once your have thoroughly your strbetween function, you can write a benchmarking framework to test performance. This is not so easy to get reliable performance figures, as you will experience if you try. Remember to configure your compiler to select the appropriate optimisations, -O3 for example.
Only then can you move to the next step: if you are really restricted from using standard C library functions, you may first recode your versions of strstr and strlen and still use the same method. Test this new version both for correctness and for performance.
The redundant parts are the computation of strlen(str1) which must have been determined by strstr when it finds a match. And the scan in strndup() which is unnecessary since no null byte is present between p and q. If you have time to waste, you can try and remove these redundancies at the expense of readability, risking non conformity. I would be surprised if you get any improvement at all on average over a wide variety of test cases. 20% would be remarkable.

C Get complete words from first 15 characters of string

I have a function that will return either the first 13 characters of a string or the second 13 characters of a string:
char* get_headsign_text(char* string, int position) {
if (position == 1){
char* myString = malloc(13);
strncpy(myString, string, 13);
myString[13] = '\0'; //null terminate destination
return myString;
free(myString);
} else {
char* myString = malloc(13);
string += 13;
strncpy(myString, string, 13);
myString[13] = '\0'; //null terminate destination
return myString;
free(myString);
}
}
I would like to have it so that the function will return only complete words (not cutoff words in the middle).
Example:
If the string is "Hi I'm Christopher"
get_headsign_text(string, 1) = "Hi I'm "
get_headsign_text(string, 2) = "Christopher"
So if the function would have cut within a word, instead it would cut before that last word, and if so, if it is trying to get the second 13 it would include the word that would have been cut.
When taking various edge cases into consideration, the structure of the code needs to change considerably.
For instance:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
inline int min_int(int a, int b) {
return a < b ? a : b;
}
inline int is_word_char(char c) {
return isgraph(c);
}
char* get_headsign_text(char* string, int position) {
int start_index, end_index;
if (position == 1) {
start_index = 0;
} else {
start_index = 13;
}
end_index = min_int(strlen(string) + 1, start_index + 13);
start_index = min_int(start_index, end_index);
int was_word_char = 1;
while(start_index > 0 && (was_word_char = is_word_char(string[start_index]))) {
--start_index;
}
if(!was_word_char) {
++start_index;
}
while(end_index > start_index && is_word_char(string[end_index])) {
--end_index;
}
int myStringLen = end_index - start_index;
char *myString = malloc(myStringLen + 1);
strncpy(myString, string + start_index, myStringLen);
myString[myStringLen] = '\0';
return myString;
}
int main(void) {
char s[] = "Hi, I\'m Christopher";
char *r1 = get_headsign_text(s, 1);
char *r2 = get_headsign_text(s, 2);
printf("<%s>\n<%s>\n", r1, r2);
free(r1);
free(r2);
return 0;
}
That said, there are numerous other problems/concerns with the code snippet you posted:
In the assignment myString[13] = '\0';, you are assigning to memory which you have not allocated. Although you have allocated 13 bytes, myString[13] refers to one byte past the last allocated byte.
Nothing after the return statement gets executed, and the calls to free are never reached.
You shouldn't be returning a block of memory only to free it immediately! It's quite counter-productive to give something to the caller only to take it away. :)
You do not validate the size of the string. Unless you are absolutely certain this will only be called on strings of sufficient length, your function will segfault when, say, position is 2 and your string buffer is only, say, 10 bytes long.
You need to check if your last character is not a blank space '' then it should find for trailing space and cut your string to that index.
Keep track of your spaces using an index variable. If the character count is 13, and the current character is not a space or null terminator, then adjust your character count by subtracting that and the last space index. Save the string, and then continue from the last space index.
You have a multitude of issues.
First issue, you're free'ing after returning the myString - meaning this function will not free the string.
Second issue. You allocate 13 chars, then you set the 13rd char to null. Are you sure this does what you expect?
Third issue - why are you adding 13 to the string pointer? What should that do?
Lastly, you should think about which character is used to separate words - when you've figured which one that is, try to scan for it and cut your string where it is.

Appending a char to a char* in C?

I'm trying to make a quick function that gets a word/argument in a string by its number:
char* arg(char* S, int Num) {
char* Return = "";
int Spaces = 0;
int i = 0;
for (i; i<strlen(S); i++) {
if (S[i] == ' ') {
Spaces++;
}
else if (Spaces == Num) {
//Want to append S[i] to Return here.
}
else if (Spaces > Num) {
return Return;
}
}
printf("%s-\n", Return);
return Return;
}
I can't find a way to put the characters into Return. I have found lots of posts that suggest strcat() or tricks with pointers, but every one segfaults. I've also seen people saying that malloc() should be used, but I'm not sure of how I'd used it in a loop like this.
I will not claim to understand what it is that you're trying to do, but your code has two problems:
You're assigning a read-only string to Return; that string will be in your
binary's data section, which is read-only, and if you try to modify it you will get a segfault.
Your for loop is O(n^2), because strlen() is O(n)
There are several different ways of solving the "how to return a string" problem. You can, for example:
Use malloc() / calloc() to allocate a new string, as has been suggested
Use asprintf(), which is similar but gives you formatting if you need
Pass an output string (and its maximum size) as a parameter to the function
The first two require the calling function to free() the returned value. The third allows the caller to decide how to allocate the string (stack or heap), but requires some sort of contract about the minumum size needed for the output string.
In your code, when the function returns, then Return will be gone as well, so this behavior is undefined. It might work, but you should never rely on it.
Typically in C, you'd want to pass the "return" string as an argument instead, so that you don't have to free it all the time. Both require a local variable on the caller's side, but malloc'ing it will require an additional call to free the allocated memory and is also more expensive than simply passing a pointer to a local variable.
As for appending to the string, just use array notation (keep track of the current char/index) and don't forget to add a null character at the end.
Example:
int arg(char* ptr, char* S, int Num) {
int i, Spaces = 0, cur = 0;
for (i=0; i<strlen(S); i++) {
if (S[i] == ' ') {
Spaces++;
}
else if (Spaces == Num) {
ptr[cur++] = S[i]; // append char
}
else if (Spaces > Num) {
ptr[cur] = '\0'; // insert null char
return 0; // returns 0 on success
}
}
ptr[cur] = '\0'; // insert null char
return (cur > 0 ? 0 : -1); // returns 0 on success, -1 on error
}
Then invoke it like so:
char myArg[50];
if (arg(myArg, "this is an example", 3) == 0) {
printf("arg is %s\n", myArg);
} else {
// arg not found
}
Just make sure you don't overflow ptr (e.g.: by passing its size and adding a check in the function).
There are numbers of ways you could improve your code, but let's just start by making it meet the standard. ;-)
P.S.: Don't malloc unless you need to. And in that case you don't.
char * Return; //by the way horrible name for a variable.
Return = malloc(<some size>);
......
......
*(Return + index) = *(S+i);
You can't assign anything to a string literal such as "".
You may want to use your loop to determine the offsets of the start of the word in your string that you're looking for. Then find its length by continuing through the string until you encounter the end or another space. Then, you can malloc an array of chars with size equal to the size of the offset+1 (For the null terminator.) Finally, copy the substring into this new buffer and return it.
Also, as mentioned above, you may want to remove the strlen call from the loop - most compilers will optimize it out but it is indeed a linear operation for every character in the array, making the loop O(n**2).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *arg(const char *S, unsigned int Num) {
char *Return = "";
const char *top, *p;
unsigned int Spaces = 0;
int i = 0;
Return=(char*)malloc(sizeof(char));
*Return = '\0';
if(S == NULL || *S=='\0') return Return;
p=top=S;
while(Spaces != Num){
if(NULL!=(p=strchr(top, ' '))){
++Spaces;
top=++p;
} else {
break;
}
}
if(Spaces < Num) return Return;
if(NULL!=(p=strchr(top, ' '))){
int len = p - top;
Return=(char*)realloc(Return, sizeof(char)*(len+1));
strncpy(Return, top, len);
Return[len]='\0';
} else {
free(Return);
Return=strdup(top);
}
//printf("%s-\n", Return);
return Return;
}
int main(){
char *word;
word=arg("make a quick function", 2);//quick
printf("\"%s\"\n", word);
free(word);
return 0;
}

Resources