Recursion with C - c

I have the code below to reverse a string recursively, it works when I print the chars after the recursion is finished, but I can not figure out how to assemble the reverse chars into a string and return them reversed to the caller. Anyone have an idea? I don't want to add another parameter to accumulate chars, just this way, this is not homework, I am brushing up on small things since I will be graduating in a year and need to do well on interviews.
char* reversestring5(char* s)
{
int i = 0;
//Not at null terminator
if(*s!=0)
{
//advance the pointer
reversestring5(s+1);
printf("%c\n",*s);
}
}

With a recursive function, it's usually easiest to first figure out how to solve a trivial case (e.g. reversing a string with just a pair of characters) and then see how one might divide up the the problem into simple operations culminating with the trivial case. For example one might do this:
This is the actual recursive function:
char *revrecurse(char *front, char *back)
{
if (front < back) {
char temp = *front;
*front = *back;
*back = temp;
revrecurse(front+1, back-1);
}
return front;
}
This part just uses the recursive function:
char *reverse(char *str)
{
return revrecurse(str, &str[strlen(str)-1]);
}
Note that this assumes that the pointer is valid and that it points to a NUL-terminated string.
If you're going to actually reverse the characters, you can either provide a pair of pointers and recursively swap letters (which is what this routine does) or copy the characters one at a time into yet another space. That's essentially what your original code is doing; copying each character at a time to stdout which is a global structure that is not explicitly passed but is being used by your routine. The analog to that approach, but using pointers might look like this:
#define MAXSTRINGLEN 200
char revbuf[MAXSTRINGLEN];
char *refptr = revbuf;
char *revstring(char *s)
{
if (*s != 0)
{
revstring(s+1);
*refptr++ = *s; /* copy non-NUL characters */
} else {
*refptr++ = '\0'; /* copy NUL character */
}
return revbuf;
}
In this minor modification to your original code, you can now see the reliance of this approach on global variables revbuf and refptr which were hidden inside stdout in your original code. Obviously this is not even close to optimized -- it's intended solely for explanatory purposes.

"Reversing a string recursively" is a very vague statement of a problem, which allows for many completely different solutions.
Note that a "reasonable" solution should avoid making excessive passes over the string. Any solution that begins with strlen is not really a reasonable one. It is recursive for the sake of being recursive and nothing more. If one resorts to making an additional pass over the string, one no longer really needs a recursive solution at all. In other words, any solution that begins with strlen is not really satisfactory.
So, let's look for a more sensible single-pass recursive solution. And you almost got it already.
Your code already taught you that the reverse sequence of characters is obtained on the backtracking phase of recursion. That's exactly where you placed your printf. So, the "straightforward" approach would be to take these reversed characters, and instead of printf-ing them just write them back into the original string starting from the beginning of the string. A naive attempt to do this might look as follows
void reversestring_impl(char* s, char **dst)
{
if (*s != '\0')
{
reversestring_impl(s + 1, dst);
*(*dst)++ = *s;
}
}
void reversestring5(char* s)
{
char *dst = s;
reversestring_impl(s, &dst);
}
Note that this implementation uses an additional parameter dst, which carries the destination location for writing the next output character. That destination location remains unchanged on the forward pass of the recursion, and gets incremented as we write output characters on the backtracking pass of the recursion.
However, the above code will not work properly, since we are working "in place", i.e. using the same string as input and output at the same time. The beginning of the string will get overwritten prematurely. This will destroy character information that will be needed on later backtracking steps. In order to work around this issue each nested level of recursion should save its current character locally before the recursive call and use the saved copy after the recursive call
void reversestring_impl(char* s, char **dst)
{
if (*s != '\0')
{
char c = *s;
reversestring_impl(s + 1, dst);
*(*dst)++ = c;
}
}
void reversestring5(char* s)
{
char *dst = s;
reversestring_impl(s, &dst);
}
int main()
{
char str[] = "123456789";
reversestring5(str);
printf("%s\n", str);
}
The above works as intended.

If you really can't use a helper function and you really can't modify the interface to the function and you really must use recursion, you could do this, horrible though it is:
char *str_reverse(char *str)
{
size_t len = strlen(str);
if (len > 1)
{
char c0 = str[0];
char c1 = str[len-1];
str[len-1] = '\0';
(void)str_reverse(str+1);
str[0] = c1;
str[len-1] = c0;
}
return str;
}
This captures the first and last characters in the string (you could survive without capturing the first), then shortens the string, calls the function on the shortened string, then reinstates the swapped first and last characters. The return value is really of no help; I only kept it to keep the interface unchanged. This is clearest when the recursive call ignores the return value.
Note that this is gruesome for performance because it evaluates strlen() (N/2) times, rather than just once. Given a gigabyte string to reverse, that matters.
I can't think of a good way to write the code without using strlen() or its equivalent. To reverse the string in situ, you have to know where the end is somehow. Since the interface you stipulate does not include the information on where the end is, you have to find the end in the function, somehow. I don't regard strchr(str, '\0') as significantly different from strlen(str), for instance.
If you change the interface to:
void mem_reverse_in_situ(char *start, char *end)
{
if (start < end)
{
char c0 = *start;
*start = *end;
*end = c0;
mem_reverse_in_situ(start+1, end-1);
}
}
Then the reversal code avoids all issues of string length (or memory length) — requiring the calling code to deal with it. The function simply swaps the ends and calls itself on the middle segment. You'd not write this as a recursive function, though; you'd use an iterative solution:
void mem_reverse_in_situ(char *start, char *end)
{
while (start < end)
{
char c0 = *start;
*start++ = *end;
*end-- = c0;
}
}

char* reversestring5(char* s){
size_t len = strlen(s);
char last[2] = {*s};
return (len > 1) ? strcat(memmove(s, reversestring5(s+1), len), last) : s;
}

This is a good question, and the answer involves a technique that apparently few people are familiar with, judging by the other answers. This does the job ... it recursively converts the string into a linked list (kept on the stack, so it's quite efficient) that represents the reversal of the string. It then converts the linked list back into a string (which it does iteratively, but the problem statement doesn't say it can't). There's a complaint in the comments that this is "overkill", but any recursive solution will be overkill ... recursion is simply not a good way to process an array in reverse. But note that there is a whole set of problems that this approach can be applied to where one generates values on the fly rather than having them already available in an array, and then they are to be processed in reverse. Since the OP is interested in developing or brushing up on skills, this answer provides extra value ... and because this technique of creating a linked list on the stack and then consuming the linked list in the termination condition (as it must be, before the memory of the linked list goes out of scope) is apparently not well known. An example is backtrack algorithms such as for the Eight Queens problem.
In response to complaints that this isn't "pure recursive" because of the iterative copy of the list to the string buffer, I've updated it to do it both ways:
#include <stdio.h>
#include <stdlib.h>
typedef struct Cnode Cnode;
struct Cnode
{
char c;
const Cnode* next;
};
static void list_to_string(char* s, const Cnode* np)
{
#ifdef ALL_RECURSIVE
if (np)
{
*s = np->c;
list_to_string(s+1, np->next);
}
else
*s = '\0';
#else
for (; np; np = np->next)
*s++ = np->c;
*s = '\0';
#endif
}
static char* reverse_string_recursive(const char* s, size_t len, const Cnode* np)
{
if (*s)
{
Cnode cn = { *s, np };
return reverse_string_recursive(s+1, len+1, &cn);
}
char* rs = malloc(len+1);
if (rs)
list_to_string(rs, np);
return rs;
}
char* reverse_string(const char* s)
{
return reverse_string_recursive(s, 0, NULL);
}
int main (int argc, char** argv)
{
if (argc > 1)
{
const char* rs = reverse_string(argv[1]);
printf("%s\n", rs? rs : "[malloc failed in reverse_string]");
}
return 0;
}

Here's a "there and back again" [Note 1] in-place reverse which:
doesn't use strlen() and doesn't need to know how long the string is in advance; and
has a maximum recursion depth of half of the string length.
It also never backs up an iterator, so if it were written in C++, it could use a forward iterator. However, that feature is less interesting because it keeps iterators on the stack and requires that you can consistently iterate forward from an iterator, so it can't use input iterators. Still, it does mean that it can be used to in-place reverse values in a singly-linked list, which is possibly slightly interesting.
static void swap(char* lo, char* hi) {
char tmp = *hi;
*hi = *lo;
*lo = tmp;
}
static char* step(char* tortoise, char* hare) {
if (hare[0]) return tortoise;
if (hare[1]) return tortoise + 1;
hare = step(tortoise + 1, hare + 2);
swap(tortoise, hare);
return hare + 1;
}
void reverse_in_place(char* str) { step(str, str); }
Note 1: The "there and back again" pattern comes from a paper by Olivier Danvy and Mayer Goldberg, which makes for fun reading. The paper still seems to be online at ftp://ftp.daimi.au.dk/pub/BRICS/pub/RS/05/3/BRICS-RS-05-3.pdf

Related

edit a string by removing characters without creating a new string, is it legit or not?

EDIT: I think I've understood how this concept works, this is my code
void delete_duplicate(char* str) {
if (str == NULL) {
exit(1);
}
for (size_t i = 0; str[i] != 0; i++) {
if (str[i] == str[i + 1]) {
str[i] = '\0';
}
}
}
int main(void) {
char str[] = "hhhhhhhheeeeeeeeyyyyyyyyyyy";
delete_duplicate(str);
return 0;
}
the output string is "00...0h0...000e0...000y" (with lots of zeros). if the string is "abbbb", the string becomes "a" and not "ab".
I was thinking about an algorithm for this exercise:
given a string, for example "ssttringg" the output must be "string". the function has to remove one character if and only if the current character and the next character are the same.
the exercise expects a function declared like this:
extern void delete_consecutive(char* str);
this is my algorithm:
loop: if the current character is equal to the next character, increment length by 1, and repeat until zero terminator is reached.
critical part: dynamic memory allocation.
allocate enough memory to store the output. Why did I say this is the critical part? because I have to do it in a void function, therefore I can't create a second string, allocate enough memory with malloc, and then return it at the end. I need to edit the original string instead. I think I can do it by means of a realloc, and by decrementing the size of the string. But I don't know how to do it, and more important: I don't know if it's legit to do it, without losing data. I think I'm overcomplicating the homework, but it's a void function, therefore I can't simply create a new string and copying characters in it, because I can't return it.
if it's not clear, I'll edit the question.
This pattern is called a retention scan (also a reduction scan, depending on your perspective), and is very common in algorithms that dictate discarding characters whilst keeping others, based on some condition. Often, the condition can change, and sometimes even the methods for starting the scan are somewhat altered
In it's simplest form, it looks something like this, as an example: An algorithm used to discard all but numeric (digit) characters:
Start with reader r and writer w pointers at the head of the string.
Iterate over the string from beginning to terminator using r. The r pointer will always be incremented exactly once per iteration.
For each iteration, check to see if the current character at r satisfies the condition for retention (in this case, is the character a digit char?). If it does, write it at w and advance w one slot.
When finished. w will point to the location where your terminator should reside.
#include <ctype.h>
void delete_nondigits(char *s)
{
if (s && *s)
{
char *w = s;
for (char *r = s; *r; ++r)
{
if (isdigit((unsigned char)*r))
*w++ = *r;
}
*w = 0;
}
}
Pretty simple.
Now, the algorithm for in-place consecutive-run compaction is more complicated, but has a highly similar model. Because retention is based on a prior-value already-retained you need to remember that last-kept value. You could just use *w, but the algorithm is actually easier to understand (and advance w) if you keep it in a separate memo char as you're about to see:
Start with reader r and writer w pointers as we had before, but starting at second slot of the string, and a single memo char variable c initialized to the first character of the string.
Iterate over the string using r until termination encounter. The r pointer will always be incremented exactly once per iteration.
For each iteration, check to see if the current character at *r is the same as the memo character c If it is, do nothing and continue to the next iteration.
Otherwise, if the character at *r is different than the memo character c, save it as the new value for c, write it to *w, and advance w one slot.
When finished, terminate the string by setting a terminator at *w.
The only hitch to this algorithm is supporting a zero-length string on inception, which is easily circumvented by checking that single condition first. I leave actually implementing it as an exercise for you. (hint: only do the above if (s && *s) is true).
You can modify the string passed as a parameter if it is modifiable.
Example:
void deleteFirstChar(char *str)
{
if(str && *str)
{
memmove(str, str + 1, strlen(str));
}
}
//illegal call
//string literals cannot be changed
void foo(void)
{
char *str = "Hello";
deleteFirstChar(str);
}
//legal call
void bar(void)
{
char str[] = "Hello";
deleteFirstChar(str);
}
This isn't an answer, but consider this code:
char str[] = "strunqg";
printf("before: %s\n", str);
modifystring(str);
printf("after: %s\n", str);
where the "modifystring" function looks like this:
void modifystring(char *p)
{
p[3] = 'i';
p[5] = p[6];
p[6] = '\0';
}
This is totally "legit". It would not work, however, to call
char *str = "strunqg";
modifystring(str); /* WRONG */
or
modifystring("strunqg"); /* WRONG */
Either of these second two would attempt to modify a string literal, and that's not copacetic.
void delete_duplicate(char* str)
{
if(str && *str)
{
char *writeTo = str, *readFrom = str;
char prev = 0;
while(*readFrom)
{
if(prev != *readFrom)
{
prev = *readFrom++;
*writeTo++ = prev;
}
else
{
readFrom++;
}
}
*writeTo = 0;
}
}
int main(void)
{
char str[] = "hhhhhhhheeeeeeeeyyyyyyyyyyy";
delete_duplicate(str);
printf("`%s`\n", str);
return 0;
}

Inserting strings into another string in C

I'm implementing a function which, given a string, a character and another string (since now we can call it the "substring"); puts the substring everywhere the character is in the string.
To explain me better, given these parameters this is what the function should return (pseudocode):
func ("aeiou", 'i', "hello") -> aehelloou
I'm using some functions from string.h lib. I have tested it with pretty good result:
char *somestring= "this$ is a tes$t wawawa$wa";
printf("%s", strcinsert(somestring, '$', "WHAT?!") );
Outputs: thisWHAT?! is a tesWHAT?!t wawawaWHAT?!wa
so for now everything is allright. The problem is when I try to do the same with, for example this string:
char *somestring= "this \"is a test\" wawawawa";
printf("%s", strcinsert(somestring, '"', "\\\"") );
since I want to change every " for a \" . When I do this, the PC collapses. I don't know why but it stops working and then shutdown. I've head some about the bad behavior of some functions of the string.h lib but I couldn't find any information about this, I really thank any help.
My code:
#define salloc(size) (str)malloc(size+1) //i'm lazy
typedef char* str;
str strcinsert (str string, char flag, str substring)
{
int nflag= 0; //this is the number of times the character appears
for (int i= 0; i<strlen(string); i++)
if (string[i]==flag)
nflag++;
str new=string;
int pos;
while (strchr(string, flag)) //since when its not found returns NULL
{
new= salloc(strlen(string)+nflag*strlen(substring)-nflag);
pos= strlen(string)-strlen(strchr(string, flag));
strncpy(new, string, pos);
strcat(new, substring);
strcat(new, string+pos+1);
string= new;
}
return new;
}
Thanks for any help!
Some advices:
refrain from typedef char* str;. The char * type is common in C and masking it will just make your code harder to be reviewed
refrain from #define salloc(size) (str)malloc(size+1) for the exact same reason. In addition don't cast malloc in C
each time you write a malloc (or calloc or realloc) there should be a corresponding free: C has no garbage collection
dynamic allocation is expensive, use it only when needed. Said differently a malloc inside a loop should be looked at twice (especially if there is no corresponding free)
always test allocation function (unrelated: and io) a malloc will simply return NULL when you exhaust memory. A nice error message is then easier to understand than a crash
learn to use a debugger: if you had executed your code under a debugger the error would have been evident
Next the cause: if the replacement string contains the original one, you fall again on it and run in an endless loop
A possible workaround: allocate the result string before the loop and advance both in the original one and the result. It will save you from unnecessary allocations and de-allocations, and will be immune to the original char being present in the replacement string.
Possible code:
// the result is an allocated string that must be freed by caller
str strcinsert(str string, char flag, str substring)
{
int nflag = 0; //this is the number of times the character appears
for (int i = 0; i<strlen(string); i++)
if (string[i] == flag)
nflag++;
str new_ = string;
int pos;
new_ = salloc(strlen(string) + nflag*strlen(substring) - nflag);
// should test new_ != NULL
char * cur = new_;
char *old = string;
while (NULL != (string = strchr(string, flag))) //since when its not found returns NULL
{
pos = string - old;
strncpy(cur, old, pos);
cur[pos] = '\0'; // strncpy does not null terminate the dest. string
strcat(cur, substring);
strcat(cur, string + 1);
cur += strlen(substring) + pos; // advance the result
old = ++string; // and the input string
}
return new_;
}
Note: I have not reverted the str and salloc but you really should do.
In your second loop, you always look for the first flag character in the string. In this case, that’ll be the one you just inserted from substring. The strchr function will always find that quote and never return NULL, so your loop will never terminate and just keep allocating memory (and not enough of it, since your string grows arbitrarily large).
Speaking of allocating memory, you need to be more careful with that. Unlike in Python, C doesn’t automatically notice when you’re no longer using memory; anything you malloc must be freed. You also allocate far more memory than you need: even in your working "this$ is a tes$t wawawa$wa" example, you allocate enough space for the full string on each iteration of the loop, and never free any of it. You should just run the allocation once, before the second loop.
This isn’t as important as the above stuff, but you should also pay attention to performance. Each call to strcat and strlen iterates over the entire string, meaning you look at it far more often than you need. You should instead save the result of strlen, and copy the new string directly to where you know the NUL terminator is. The same goes for strchr; you already replaced the beginning of the string and don’t want to waste time looking at it again, apart from the part where that’s causing your current bug.
In comparison to these issues, the style issues mentioned in the comments with your typedef and macro are relatively minor, but they are still worth mentioning. A char* in C is different from a str in Python; trying to typedef it to the same name just makes it more likely you’ll try to treat them as the same and run into these issues.
I don't know why but it stops working
strchr(string, flag) is looking over the whole string for flag. Search needs to be limited to the portion of the string not yet examined/updated. By re-searching the partially replaces string, code is finding the flag over and over.
The whole string management approach needs re-work. As OP reported a Python background, I've posted a very C approach as mimicking Python is not a good approach here. C is different especially in the management of memory.
Untested code
// Look for needles in a haystack and replace them
// Note that replacement may be "" and result in a shorter string than haystack
char *strcinsert_alloc(const char *haystack, char needle, const char *replacment) {
size_t n = 0;
const char *s = haystack;
while (*s) {
if (*s == needle) n++; // Find needle count
s++;
}
size_t replacemnet_len = strlen(replacment);
// string length - needles + replacements + \0
size_t new_size = (size_t)(s - haystack) - n*1 + n*replacemnet_len + 1;
char *dest = malloc(new_size);
if (dest) {
char *d = dest;
s = haystack;
while (*s) {
if (*s == needle) {
memcpy(d, s, replacemnet_len);
d += replacemnet_len;
} else {
*d = *s;
d++;
}
s++;
}
*d = '\0';
}
return dest;
}
In your program, you are facing problem for input -
char *somestring= "this \"is a test\" wawawawa";
as you want to replace " for a \".
The first problem is whenever you replace " for a \" in string, in next iteration strchr(string, flag) will find the last inserted " of \". So, in subsequent interations your string will form like this -
this \"is a test" wawawawa
this \\"is a test" wawawawa
this \\\"is a test" wawawawa
So, for input string "this \"is a test\" wawawawa" your while loop will run for infinite times as every time strchr(string, flag) finds the last inserted " of \".
The second problem is the memory allocation you are doing in your while loop in every iteration. There is no free() for the allocated memory to new. So when while loop run infinitely, it will eat up all the memory which will lead to - the PC collapses.
To resolve this, in every iteration, you should search for flag only in the string starting from a character after the last inserted substring to the end of the string. Also, make sure to free() the dynamically allocated memory.

How to determine the length of a string (without using strlen())

size_t stringlength(const char *s)
Using this function, how could find the length of a string? I am not referring to using strlen(), but creating it. Any help is greatly appreciated.
Cycle/iterate through the string, keeping a count. When you hit \0, you have reached the end of your string.
The basic concepts involved are a loop, a conditional (to test for the end of string), maintaining a counter and accessing elements in a sequence of characters.
Note: There are more idiomatic/clever solution. However OP is clearly new to C and programming (no offense intended, we all started out as newbies), so to inflict pointer arithmetic on them as one of solutions did or write perhaps overly terse/compact solutions is less about OP's needs and more about a demonstration of the posters' programming skills :) Intentionally providing suggestions for a simple-to-understand solution earned me at least one downvote (yes, this for "imaginary code" that I didn't even provide. I didn't want to ready-serve a code solution, but let OP figure it out with some guidance).
Main Point: I think answers should always be adjusted to the level the questioner.
size_t stringlength(const char *s) {
size_t count = 0;
while (*(s++) != '\0') count++;
return count;
}
The confusing part could be the expression *(s++), here you're moving the pointer to point the next character in the buffer using the ++ operator, then you're using the dereferencing operator * to get the content at the pointer position. Another more legible approach would be:
size_t stringlength(const char *s) {
size_t count = 0;
while (s[count] != '\0') count++;
return count;
}
Another couple of reference versions (but less legible) are:
size_t stringlength(const char *s) {
size_t count = 0;
while (*s++) count++;
return count;
}
size_t stringlength(const char *s) {
const char* c = s;
for (; *c; c++);
return c - s;
}
Although the code stated here is just a reference to give you ideas of how to implement the algorithm described in the above answer, there exists more efficient ways of doing the same requirement (check the glibc implementation for example, that checks 4 bytes at a time)
This might not be a relevant code, But I think it worth to know.
Since it saves time...
int a[] = {1,2,3,4,5,6};
unsigned int i,j;
i = &a; //returns first address of the array say 100
j = &a+1; //returns last address of the array say 124
int size = (j-i)/sizeof(int); // (j-i) would be 24 and (24/4) would be 6
//assuming integer is of 4 bytes
printf("Size of int array a is :%d\n",size);
And for strings ::
char a[] = "Hello";
unsigned int i,j;
j = &a+1; //returns last address of the array say 106
i = &a; //returns first address of the array say 100
printf("size of string a is : %d\n",(j-i)-1); // (j-i) would be 6
If you are confused how come &a+1 returns the last address of the array, check this link.
Assuming s is a non-null pointer, the following function traverses s from its beginning until the terminating zero is found. For each character passed s++; count is incremented count++;.
size_t stringlength(const char *s) {
size_t count = 0;
while (*s) {
s++;
count++;
}
return count;
}

please help making strstr()

I have made strstr() function but the program does not give any output,just a blank screen.Please have a look at the code.
#include<stdio.h>
#include<conio.h>
const char* mystrstr(const char *str1, const char *str2);
int main()
{
const char *str1="chal bhai nikal";
const char *str2="nikal",*result;
result=mystrstr(str1,str2);
printf("found at %d location",(int*)*result);
getch();
return 0;
}
const char * mystrstr(const char *s1, const char *s2)
{
int i,j,k,len2,count=0;
char *p;
for(len2=0;*s2!='\0';len2++);//len2 becomes the length of s2
for(i=0,count=0;*s1!='\0';i++)
{
if(*(s1+i)==*s2)
{
for(j=i,k=0;*s2!='\0';j++,k++)
{
if(*(s1+j)==*(s2+i))
count++;
if(count==len2)
{
p=(char*)malloc(sizeof(char*));
*p='i';
return p;
}
}
}
}
return NULL;
}
The line with this comment:
//len2 becomes the length of s2
is broken. You repeatedly check the first character of s2. Instead of *s2, try s2[len2].
Edit: as others have said, there are apparently a lot more things wrong with this implementation. If you want the naive, brute-force strstr algorithm, here's a concise and fast version of it:
char *naive_strstr(const char *h, const char *n)
{
size_t i;
for (i=0; n[i] && h[i]; i++)
for (; n[i] != h[i]; h++) i=0;
return n[i] ? 0 : (char *)h;
}
It looks like this is an exercise you're doing to learn more about algorithms and C strings and pointers, so I won't solve those issues for you, but here are some starting points:
You have an infinite loop when calculating len2 (your loop condition is *s2 but you're never changing s2)
You have a similar issue with the second for loop, although I you have an early return so it might not be infinite, but I doubt the condition is correct
Given you want to behave like strstr(), you should return a pointer to the first string, not a new pointer you allocate. There is no reason for you to allocate during a function like strstr.
In main() if you want to calculate the position of the found string, you want to print result-str1 unless result is NULL). (int*)*result makes no sense - result should be a pointer to the string (or NULL)
You also need to change this line:
if(*(s1+j)==*(s2+i))
to this:
if(*(s1+j)==*(s2+k))
As already mentioned, the return value is a bit odd. You are returning a char* but kind of trying to put an integer value in it. The result doesn't make sense logically. You should either return a pointer to the location where it is found (no malloc needed) or return the integer position (i). But returning the integer position is not the "typical" strstr implementation.

The intricacy of a string tokenization function in C

For brushing up my C, I'm writing some useful library code. When it came to reading text files, it's always useful to have a convenient tokenization function that does most of the heavy lifting (looping on strtok is inconvenient and dangerous).
When I wrote this function, I'm amazed at its intricacy. To tell the truth, I'm almost convinced that it contains bugs (especially with memory leaks in case of an allocation error). Here's the code:
/* Given an input string and separators, returns an array of
** tokens. Each token is a dynamically allocated, NUL-terminated
** string. The last element of the array is a sentinel NULL
** pointer. The returned array (and all the strings in it) must
** be deallocated by the caller.
**
** In case of errors, NULL is returned.
**
** This function is much slower than a naive in-line tokenization,
** since it copies the input string and does many allocations.
** However, it's much more convenient to use.
*/
char** tokenize(const char* input, const char* sep)
{
/* strtok ruins its input string, so we'll work on a copy
*/
char* dup;
/* This is the array filled with tokens and returned
*/
char** toks = 0;
/* Current token
*/
char* cur_tok;
/* Size of the 'toks' array. Starts low and is doubled when
** exhausted.
*/
size_t size = 2;
/* 'ntok' points to the next free element of the 'toks' array
*/
size_t ntok = 0;
size_t i;
if (!(dup = strdup(input)))
return NULL;
if (!(toks = malloc(size * sizeof(*toks))))
goto cleanup_exit;
cur_tok = strtok(dup, sep);
/* While we have more tokens to process...
*/
while (cur_tok)
{
/* We should still have 2 empty elements in the array,
** one for this token and one for the sentinel.
*/
if (ntok > size - 2)
{
char** newtoks;
size *= 2;
newtoks = realloc(toks, size * sizeof(*toks));
if (!newtoks)
goto cleanup_exit;
toks = newtoks;
}
/* Now the array is definitely large enough, so we just
** copy the new token into it.
*/
toks[ntok] = strdup(cur_tok);
if (!toks[ntok])
goto cleanup_exit;
ntok++;
cur_tok = strtok(0, sep);
}
free(dup);
toks[ntok] = 0;
return toks;
cleanup_exit:
free(dup);
for (i = 0; i < ntok; ++i)
free(toks[i]);
free(toks);
return NULL;
}
And here's simple usage:
int main()
{
char line[] = "The quick brown fox jumps over the lazy dog";
char** toks = tokenize(line, " \t");
int i;
for (i = 0; toks[i]; ++i)
printf("%s\n", toks[i]);
/* Deallocate
*/
for (i = 0; toks[i]; ++i)
free(toks[i]);
free(toks);
return 0;
}
Oh, and strdup:
/* strdup isn't ANSI C, so here's one...
*/
char* strdup(const char* str)
{
size_t len = strlen(str) + 1;
char* dup = malloc(len);
if (dup)
memcpy(dup, str, len);
return dup;
}
A few things to note about the code of the tokenize function:
strtok has the impolite habit of writing over its input string. To save the user's data, I only call it on a duplicate of the input. The duplicate is obtained using strdup.
strdup isn't ANSI-C, however, so I had to write one
The toks array is grown dynamically with realloc, since we have no idea in advance how many tokens there will be. The initial size is 2 just for testing, in real-life code I would probably set it to a much higher value. It's also returned to the user, and the user has to deallocate it after use.
In all cases, extreme care is taken not to leak resources. For example, if realloc returns NULL, it won't run over the old pointer. The old pointer will be released and the function returns. No resources leak when tokenize returns (except in the nominal case where the array returned to the user must be deallocated after use).
A goto is used for more convenient cleanup code, according to the philosophy that goto can be good in some cases (this is a good example, IMHO).
The following function can help with simple deallocation in a single call:
/* Given a pointer to the tokens array returned by 'tokenize',
** frees the array and sets it to point to NULL.
*/
void tokenize_free(char*** toks)
{
if (toks && *toks)
{
int i;
for (i = 0; (*toks)[i]; ++i)
free((*toks)[i]);
free(*toks);
*toks = 0;
}
}
I'd really like to discuss this code with other users of SO. What could've been done better? Would you recommend a difference interface to such a tokenizer? How is the burden of deallocation taken from the user? Are there memory leaks in the code anyway?
Thanks in advance
One thing I would recommend is to provide tokenize_free that handles all the deallocations. It's easier on the user and gives you the flexibility to change your allocation strategy in the future without breaking users of your library.
The code below fails when the first character of the string is a separator:
One additional idea is not to bother duplicating each individual token. I don't see what it adds and just gives you more places where the code can file. Instead, just keep the duplicate of the full buffer you made. What I mean is change:
toks[ntok] = strdup(cur_tok);
if (!toks[ntok])
goto cleanup_exit;
to:
toks[ntok] = cur_tok;
Drop the line free(buf) from the non-error path. Finally, this changes cleanup to:
free(toks[0]);
free(toks);
You don't need to strdup() each token; you duplicate the input string, and could let strtok() chop that up. It simplifies releasing the resources afterwards, too - you only have to release the array of pointers and the single string.
I agree with those who say that you need a function to release the data - unless you change the interface radically and have the user provide the array of pointers as an input parameter, and then you would probably also decide that the user is responsible for duplicating the string if it must be preserved. That leads to an interface:
int tokenize(char *source, const char *sep, char **tokens, size_t max_tokens);
The return value would be the number of tokens found.
You have to decide what to do when there are more tokens than slots in the array. Options include:
returning an error indication (negative number, likely -1), or
the full number of tokens found but the pointers that can't be assigned aren't, or
just the number of tokens that fitted, or
one more than the number of tokens, indicating that there were more, but no information on exactly how many more.
I chose to return '-1', and it lead to this code:
/*
#(#)File: $RCSfile: tokenise.c,v $
#(#)Version: $Revision: 1.9 $
#(#)Last changed: $Date: 2008/02/11 08:44:50 $
#(#)Purpose: Tokenise a string
#(#)Author: J Leffler
#(#)Copyright: (C) JLSS 1987,1989,1991,1997-98,2005,2008
#(#)Product: :PRODUCT:
*/
/*TABSTOP=4*/
/*
** 1. A token is 0 or more characters followed by a terminator or separator.
** The terminator is ASCII NUL '\0'. The separators are user-defined.
** 2. A leading separator is preceded by a zero-length token.
** A trailing separator is followed by a zero-length token.
** 3. The number of tokens found is returned.
** The list of token pointers is terminated by a NULL pointer.
** 4. The routine returns 0 if the arguments are invalid.
** It returns -1 if too many tokens were found.
*/
#include "jlss.h"
#include <string.h>
#define NO 0
#define YES 1
#define IS_SEPARATOR(c,s,n) (((c) == *(s)) || ((n) > 1 && strchr((s),(c))))
#define DIM(x) (sizeof(x)/sizeof(*(x)))
#ifndef lint
/* Prevent over-aggressive optimizers from eliminating ID string */
const char jlss_id_tokenise_c[] = "#(#)$Id: tokenise.c,v 1.9 2008/02/11 08:44:50 jleffler Exp $";
#endif /* lint */
int tokenise(
char *str, /* InOut: String to be tokenised */
char *sep, /* In: Token separators */
char **token, /* Out: Pointers to tokens */
int maxtok, /* In: Maximum number of tokens */
int nulls) /* In: Are multiple separators OK? */
{
int c;
int n_tokens;
int tokenfound;
int n_sep = strlen(sep);
if (n_sep <= 0 || maxtok <= 2)
return(0);
n_tokens = 1;
*token++ = str;
while ((c = *str++) != '\0')
{
tokenfound = NO;
while (c != '\0' && IS_SEPARATOR(c, sep, n_sep))
{
tokenfound = YES;
*(str - 1) = '\0';
if (nulls)
break;
c = *str++;
}
if (tokenfound)
{
if (++n_tokens >= maxtok - 1)
return(-1);
if (nulls)
*token++ = str;
else
*token++ = str - 1;
}
if (c == '\0')
break;
}
*token++ = 0;
return(n_tokens);
}
#ifdef TEST
struct
{
char *sep;
int nulls;
} data[] =
{
{ "/.", 0 },
{ "/.", 1 },
{ "/", 0 },
{ "/", 1 },
{ ".", 0 },
{ ".", 1 },
{ "", 0 }
};
static char string[] = "/fred//bill.c/joe.b/";
int main(void)
{
int i;
int j;
int n;
char input[100];
char *token[20];
for (i = 0; i < DIM(data); i++)
{
strcpy(input, string);
printf("\n\nTokenising <<%s>> using <<%s>>, null %d\n",
input, data[i].sep, data[i].nulls);
n = tokenise(input, data[i].sep, token, DIM(token),
data[i].nulls);
printf("Return value = %d\n", n);
for (j = 0; j < n; j++)
printf("Token %d: <<%s>>\n", j, token[j]);
if (n > 0)
printf("Token %d: 0x%08lX\n", n, (unsigned long)token[n]);
}
return(0);
}
#endif /* TEST */
I don't see anything wrong with the strtok approach to modifying a string in-line - it's the callers choice if they want to operate on a duplicated string or not as the semantics are well understood. Below is the same method slightly simplified to use strtok as intended, yet still return a handy array of char * pointers (which now simply point to the tokenized segments of the original string). It gives the same output for your original main() call.
The main advantage of this approach is that you only have to free the returned character array, instead of looping through to clear all of the elements - an aspect which I thought took away a lot of the simplicity factor and something a caller would be very unlikely to expect to do by any normal C convention.
I also took out the goto statements, because with the code refactored they just didn't make much sense to me. I think the danger of having a single cleanup point is that it can start to grow too unwieldy and do extra steps that are not needed to clean up issues at specific locations.
Personally I think the main philosophical point I would make is that you should respect what other people using the language are going to expect, especially when creating library kinds of calls. Even if the strtok replacement behavior seems odd to you, the vast majority of C programmers are used to placing \0 in the middle of C strings to split them up or create shorter strings and so this will seem quite natural. Also as noted no-one is going to expect to do anything beyond a single free() with the return value from a function. You need to write your code in whatever way needed to make sure then that the code works that way, as people will simply not read any documentation you might offer and will instead act according to the memory convention of your return value (which is char ** so a caller would expect to have to free that).
char** tokenize(char* input, const char* sep)
{
/* Size of the 'toks' array. Starts low and is doubled when
** exhausted.
*/
size_t size = 4;
/* 'ntok' points to the next free element of the 'toks' array
*/
size_t ntok = 0;
/* This is the array filled with tokens and returned
*/
char** toks = malloc(size * sizeof(*toks));
if ( toks == NULL )
return;
toks[ntok] = strtok( input, sep );
/* While we have more tokens to process...
*/
do
{
/* We should still have 2 empty elements in the array,
** one for this token and one for the sentinel.
*/
if (ntok > size - 2)
{
char** newtoks;
size *= 2;
newtoks = realloc(toks, size * sizeof(*toks));
if (newtoks == NULL)
{
free(toks);
return NULL;
}
toks = newtoks;
}
ntok++;
toks[ntok] = strtok(0, sep);
} while (toks[ntok]);
return toks;
}
Just a few things:
Using gotos is not intrinsically evil or bad, much like the preprocessor, they are often abused. In cases like yours where you have to exit a function differently depending on how things went, they are appropriate.
Provide a functional means of freeing the returned array. I.e. tok_free(pointer).
Use the re-entrant version of strtok() initially, i.e. strtok_r(). It would not be cumbersome for someone to pass an additional argument (even NULL if not needed) for that.
there is a great tools to detect Memory leak which is called Valgrind.
http://valgrind.org/
If you want to find memory leaks, one possibility is to run it with valgrind.

Resources