The intricacy of a string tokenization function in C - c

For brushing up my C, I'm writing some useful library code. When it came to reading text files, it's always useful to have a convenient tokenization function that does most of the heavy lifting (looping on strtok is inconvenient and dangerous).
When I wrote this function, I'm amazed at its intricacy. To tell the truth, I'm almost convinced that it contains bugs (especially with memory leaks in case of an allocation error). Here's the code:
/* Given an input string and separators, returns an array of
** tokens. Each token is a dynamically allocated, NUL-terminated
** string. The last element of the array is a sentinel NULL
** pointer. The returned array (and all the strings in it) must
** be deallocated by the caller.
**
** In case of errors, NULL is returned.
**
** This function is much slower than a naive in-line tokenization,
** since it copies the input string and does many allocations.
** However, it's much more convenient to use.
*/
char** tokenize(const char* input, const char* sep)
{
/* strtok ruins its input string, so we'll work on a copy
*/
char* dup;
/* This is the array filled with tokens and returned
*/
char** toks = 0;
/* Current token
*/
char* cur_tok;
/* Size of the 'toks' array. Starts low and is doubled when
** exhausted.
*/
size_t size = 2;
/* 'ntok' points to the next free element of the 'toks' array
*/
size_t ntok = 0;
size_t i;
if (!(dup = strdup(input)))
return NULL;
if (!(toks = malloc(size * sizeof(*toks))))
goto cleanup_exit;
cur_tok = strtok(dup, sep);
/* While we have more tokens to process...
*/
while (cur_tok)
{
/* We should still have 2 empty elements in the array,
** one for this token and one for the sentinel.
*/
if (ntok > size - 2)
{
char** newtoks;
size *= 2;
newtoks = realloc(toks, size * sizeof(*toks));
if (!newtoks)
goto cleanup_exit;
toks = newtoks;
}
/* Now the array is definitely large enough, so we just
** copy the new token into it.
*/
toks[ntok] = strdup(cur_tok);
if (!toks[ntok])
goto cleanup_exit;
ntok++;
cur_tok = strtok(0, sep);
}
free(dup);
toks[ntok] = 0;
return toks;
cleanup_exit:
free(dup);
for (i = 0; i < ntok; ++i)
free(toks[i]);
free(toks);
return NULL;
}
And here's simple usage:
int main()
{
char line[] = "The quick brown fox jumps over the lazy dog";
char** toks = tokenize(line, " \t");
int i;
for (i = 0; toks[i]; ++i)
printf("%s\n", toks[i]);
/* Deallocate
*/
for (i = 0; toks[i]; ++i)
free(toks[i]);
free(toks);
return 0;
}
Oh, and strdup:
/* strdup isn't ANSI C, so here's one...
*/
char* strdup(const char* str)
{
size_t len = strlen(str) + 1;
char* dup = malloc(len);
if (dup)
memcpy(dup, str, len);
return dup;
}
A few things to note about the code of the tokenize function:
strtok has the impolite habit of writing over its input string. To save the user's data, I only call it on a duplicate of the input. The duplicate is obtained using strdup.
strdup isn't ANSI-C, however, so I had to write one
The toks array is grown dynamically with realloc, since we have no idea in advance how many tokens there will be. The initial size is 2 just for testing, in real-life code I would probably set it to a much higher value. It's also returned to the user, and the user has to deallocate it after use.
In all cases, extreme care is taken not to leak resources. For example, if realloc returns NULL, it won't run over the old pointer. The old pointer will be released and the function returns. No resources leak when tokenize returns (except in the nominal case where the array returned to the user must be deallocated after use).
A goto is used for more convenient cleanup code, according to the philosophy that goto can be good in some cases (this is a good example, IMHO).
The following function can help with simple deallocation in a single call:
/* Given a pointer to the tokens array returned by 'tokenize',
** frees the array and sets it to point to NULL.
*/
void tokenize_free(char*** toks)
{
if (toks && *toks)
{
int i;
for (i = 0; (*toks)[i]; ++i)
free((*toks)[i]);
free(*toks);
*toks = 0;
}
}
I'd really like to discuss this code with other users of SO. What could've been done better? Would you recommend a difference interface to such a tokenizer? How is the burden of deallocation taken from the user? Are there memory leaks in the code anyway?
Thanks in advance

One thing I would recommend is to provide tokenize_free that handles all the deallocations. It's easier on the user and gives you the flexibility to change your allocation strategy in the future without breaking users of your library.
The code below fails when the first character of the string is a separator:
One additional idea is not to bother duplicating each individual token. I don't see what it adds and just gives you more places where the code can file. Instead, just keep the duplicate of the full buffer you made. What I mean is change:
toks[ntok] = strdup(cur_tok);
if (!toks[ntok])
goto cleanup_exit;
to:
toks[ntok] = cur_tok;
Drop the line free(buf) from the non-error path. Finally, this changes cleanup to:
free(toks[0]);
free(toks);

You don't need to strdup() each token; you duplicate the input string, and could let strtok() chop that up. It simplifies releasing the resources afterwards, too - you only have to release the array of pointers and the single string.
I agree with those who say that you need a function to release the data - unless you change the interface radically and have the user provide the array of pointers as an input parameter, and then you would probably also decide that the user is responsible for duplicating the string if it must be preserved. That leads to an interface:
int tokenize(char *source, const char *sep, char **tokens, size_t max_tokens);
The return value would be the number of tokens found.
You have to decide what to do when there are more tokens than slots in the array. Options include:
returning an error indication (negative number, likely -1), or
the full number of tokens found but the pointers that can't be assigned aren't, or
just the number of tokens that fitted, or
one more than the number of tokens, indicating that there were more, but no information on exactly how many more.
I chose to return '-1', and it lead to this code:
/*
#(#)File: $RCSfile: tokenise.c,v $
#(#)Version: $Revision: 1.9 $
#(#)Last changed: $Date: 2008/02/11 08:44:50 $
#(#)Purpose: Tokenise a string
#(#)Author: J Leffler
#(#)Copyright: (C) JLSS 1987,1989,1991,1997-98,2005,2008
#(#)Product: :PRODUCT:
*/
/*TABSTOP=4*/
/*
** 1. A token is 0 or more characters followed by a terminator or separator.
** The terminator is ASCII NUL '\0'. The separators are user-defined.
** 2. A leading separator is preceded by a zero-length token.
** A trailing separator is followed by a zero-length token.
** 3. The number of tokens found is returned.
** The list of token pointers is terminated by a NULL pointer.
** 4. The routine returns 0 if the arguments are invalid.
** It returns -1 if too many tokens were found.
*/
#include "jlss.h"
#include <string.h>
#define NO 0
#define YES 1
#define IS_SEPARATOR(c,s,n) (((c) == *(s)) || ((n) > 1 && strchr((s),(c))))
#define DIM(x) (sizeof(x)/sizeof(*(x)))
#ifndef lint
/* Prevent over-aggressive optimizers from eliminating ID string */
const char jlss_id_tokenise_c[] = "#(#)$Id: tokenise.c,v 1.9 2008/02/11 08:44:50 jleffler Exp $";
#endif /* lint */
int tokenise(
char *str, /* InOut: String to be tokenised */
char *sep, /* In: Token separators */
char **token, /* Out: Pointers to tokens */
int maxtok, /* In: Maximum number of tokens */
int nulls) /* In: Are multiple separators OK? */
{
int c;
int n_tokens;
int tokenfound;
int n_sep = strlen(sep);
if (n_sep <= 0 || maxtok <= 2)
return(0);
n_tokens = 1;
*token++ = str;
while ((c = *str++) != '\0')
{
tokenfound = NO;
while (c != '\0' && IS_SEPARATOR(c, sep, n_sep))
{
tokenfound = YES;
*(str - 1) = '\0';
if (nulls)
break;
c = *str++;
}
if (tokenfound)
{
if (++n_tokens >= maxtok - 1)
return(-1);
if (nulls)
*token++ = str;
else
*token++ = str - 1;
}
if (c == '\0')
break;
}
*token++ = 0;
return(n_tokens);
}
#ifdef TEST
struct
{
char *sep;
int nulls;
} data[] =
{
{ "/.", 0 },
{ "/.", 1 },
{ "/", 0 },
{ "/", 1 },
{ ".", 0 },
{ ".", 1 },
{ "", 0 }
};
static char string[] = "/fred//bill.c/joe.b/";
int main(void)
{
int i;
int j;
int n;
char input[100];
char *token[20];
for (i = 0; i < DIM(data); i++)
{
strcpy(input, string);
printf("\n\nTokenising <<%s>> using <<%s>>, null %d\n",
input, data[i].sep, data[i].nulls);
n = tokenise(input, data[i].sep, token, DIM(token),
data[i].nulls);
printf("Return value = %d\n", n);
for (j = 0; j < n; j++)
printf("Token %d: <<%s>>\n", j, token[j]);
if (n > 0)
printf("Token %d: 0x%08lX\n", n, (unsigned long)token[n]);
}
return(0);
}
#endif /* TEST */

I don't see anything wrong with the strtok approach to modifying a string in-line - it's the callers choice if they want to operate on a duplicated string or not as the semantics are well understood. Below is the same method slightly simplified to use strtok as intended, yet still return a handy array of char * pointers (which now simply point to the tokenized segments of the original string). It gives the same output for your original main() call.
The main advantage of this approach is that you only have to free the returned character array, instead of looping through to clear all of the elements - an aspect which I thought took away a lot of the simplicity factor and something a caller would be very unlikely to expect to do by any normal C convention.
I also took out the goto statements, because with the code refactored they just didn't make much sense to me. I think the danger of having a single cleanup point is that it can start to grow too unwieldy and do extra steps that are not needed to clean up issues at specific locations.
Personally I think the main philosophical point I would make is that you should respect what other people using the language are going to expect, especially when creating library kinds of calls. Even if the strtok replacement behavior seems odd to you, the vast majority of C programmers are used to placing \0 in the middle of C strings to split them up or create shorter strings and so this will seem quite natural. Also as noted no-one is going to expect to do anything beyond a single free() with the return value from a function. You need to write your code in whatever way needed to make sure then that the code works that way, as people will simply not read any documentation you might offer and will instead act according to the memory convention of your return value (which is char ** so a caller would expect to have to free that).
char** tokenize(char* input, const char* sep)
{
/* Size of the 'toks' array. Starts low and is doubled when
** exhausted.
*/
size_t size = 4;
/* 'ntok' points to the next free element of the 'toks' array
*/
size_t ntok = 0;
/* This is the array filled with tokens and returned
*/
char** toks = malloc(size * sizeof(*toks));
if ( toks == NULL )
return;
toks[ntok] = strtok( input, sep );
/* While we have more tokens to process...
*/
do
{
/* We should still have 2 empty elements in the array,
** one for this token and one for the sentinel.
*/
if (ntok > size - 2)
{
char** newtoks;
size *= 2;
newtoks = realloc(toks, size * sizeof(*toks));
if (newtoks == NULL)
{
free(toks);
return NULL;
}
toks = newtoks;
}
ntok++;
toks[ntok] = strtok(0, sep);
} while (toks[ntok]);
return toks;
}

Just a few things:
Using gotos is not intrinsically evil or bad, much like the preprocessor, they are often abused. In cases like yours where you have to exit a function differently depending on how things went, they are appropriate.
Provide a functional means of freeing the returned array. I.e. tok_free(pointer).
Use the re-entrant version of strtok() initially, i.e. strtok_r(). It would not be cumbersome for someone to pass an additional argument (even NULL if not needed) for that.

there is a great tools to detect Memory leak which is called Valgrind.
http://valgrind.org/

If you want to find memory leaks, one possibility is to run it with valgrind.

Related

Segfault after strsep only when compiling with clang 10

I am writing a parser (for NMEA sentences) which splits a string on commas using strsep. When compiled with clang (Apple LLVM version 10.0.1), the code segfaults when splitting a string which has an even number of tokens. When compiled with clang (version 7.0.1) or gcc (9.1.1) on Linux the code works correctly.
A stripped down version of the code which exhibits the issue is as follows:
#include <stdio.h>
#include <stdint.h>
#include <string.h>
static void gnss_parse_gsa (uint8_t argc, char **argv)
{
}
/**
* Desciptor for a NMEA sentence parser
*/
struct gps_parser_t {
void (*parse)(uint8_t, char**);
const char *type;
};
/**
* List of avaliable NMEA sentence parsers
*/
static const struct gps_parser_t nmea_parsers[] = {
{.parse = gnss_parse_gsa, .type = "GPGSA"}
};
static void gnss_line_callback (char *line)
{
/* Count the number of comma seperated tokens in the line */
uint8_t num_args = 1;
for (uint16_t i = 0; i < strlen(line); i++) {
num_args += (line[i] == ',');
}
/* Tokenize the sentence */
char *args[num_args];
for (uint16_t i = 0; (args[i] = strsep(&line, ",")) != NULL; i++);
/* Run parser for received sentence */
uint8_t num_parsers = sizeof(nmea_parsers)/sizeof(nmea_parsers[0]);
for (int i = 0; i < num_parsers; i++) {
if (!strcasecmp(args[0] + 1, nmea_parsers[i].type)) {
nmea_parsers[i].parse(num_args, args);
break;
}
}
}
int main (int argc, char **argv)
{
char pgsa_str[] = "$GPGSA,A,3,02,12,17,03,19,23,06,,,,,,1.41,1.13,0.85*03";
gnss_line_callback(pgsa_str);
}
The segfault occurs at on the line if (!strcasecmp(args[0] + 1, nmea_parsers[i].type)) {, the index operation on args attempts to deference a null pointer.
Increasing the size of the stack, either by manually editing the assembly or adding a call to printf("") anywhere in the function makes it no longer segfault, as does making the args array bigger (eg. adding one to num_args).
In summary, any of the following items prevent the segfault:
- Using a compiler other than clang 10
- Modifying the assembly to make the stack size before dynamic allocation 80 bytes or more (compiles to 64)
- Using an input string with an odd number of tokens
- Allocating args as a fixed length array with the correct number of tokens (or more)
- Allocating args as a variable length array with at least num_args + 1 elements
Note that when compiled with clang 7 on Linux the stack size before dynamic allocation is still 64 bytes, but the code does not segfault.
I'm hoping that someone might be able to explain why this happens, and if there is any way I can get this code to compile correctly with clang 10.
When all sorts of barely-relevant factors like the specific version of the compiler seem to make a difference, it's a pretty sure sign you've got undefined behavior somewhere.
You correctly count the commas to predetermine the exact number of fields, num_args. You allocate an array just barely big enough to hold those fields:
char *args[num_args];
But then you run this loop:
for (uint16_t i = 0; (args[i] = strsep(&line, ",")) != NULL; i++);
There are going to be num_args number of trips through this loop where strsep returns non-NULL pointers that get filled in to args[0] through args[num_args-1], which is what you intended, and which is fine. But then there's one more call to strsep, the one that returns NULL and terminates the loop -- but that null pointer also gets stored into the args array also, specifically into args[num_args], which is one cell off the end. Array overflow, in other words.
There are two ways to fix this. You can use an additional variable so you can capture and test strsep's return value before storing it into the args array:
char *p;
for (uint16_t i = 0; (p = strsep(&line, ",")) != NULL; i++)
args[i] = p;
This also has the side benefit that you have a more conventional loop, with an actual body.
Or, you can declare the args array one bigger than it strictly has to be, meaning that it's got room for that last, NULL pointer stored in args[num_args]:
char *args[num_args+1];
This has the side benefit that you always pass a "NULL terminated array" to the parsing functions, which can be handy for them (and which ends up matching, as it happens, the way main gets called).

How to put a char into a empty pointer of a string in pure C

I want to store a single char into a char array pointer and that action is in a while loop, adding in a new char every time. I strictly want to be into a variable and not printed because I am going to compare the text. Here's my code:
#include <stdio.h>
#include <string.h>
int main()
{
char c;
char *string;
while((c=getchar())!= EOF) //gets the next char in stdin and checks if stdin is not EOF.
{
char temp[2]; // I was trying to convert c, a char to temp, a const char so that I can use strcat to concernate them to string but printf returns nothing.
temp[0]=c; //assigns temp
temp[1]='\0'; //null end point
strcat(string,temp); //concernates the strings
}
printf(string); //prints out the string.
return 0;
}
I am using GCC on Debain (POSIX/UNIX operating system) and want to have windows compatability.
EDIT:
I notice some communication errors with what I actually intend to do so I will explain: I want to create a system where I can input a unlimited amount of characters and have the that input be store in a variable and read back from a variable to me, and to get around using realloc and malloc I made it so it would get the next available char until EOF. Keep in mind that I am a beginner to C (though most of you have probably guess it first) and haven't had a lot of experience memory management.
If you want unlimited amount of character input, you'll need to actively manage the size of your buffer. Which is not as hard as it sounds.
first use malloc to allocate, say, 1000 bytes.
read until this runs out.
use realloc to allocate 2000
read until this runs out.
like this:
int main(){
int buf_size=1000;
char* buf=malloc(buf_size);
char c;
int n=0;
while((c=getchar())!= EOF)
buf[n++] = c;
if(n=>buf_size-1)
{
buf_size+=1000;
buf=realloc(buf, buf_size);
}
}
buf[n] = '\0'; //add trailing 0 at the end, to make it a proper string
//do stuff with buf;
free(buf);
return 0;
}
You won't get around using malloc-oids if you want unlimited input.
You have undefined behavior.
You never set string to point anywhere, so you can't dereference that pointer.
You need something like:
char buf[1024] = "", *string = buf;
that initializes string to point to valid memory where you can write, and also sets that memory to an empty string so you can use strcat().
Note that looping strcat() like this is very inefficient, since it needs to find the end of the destination string on each call. It's better to just use pointers.
char *string;
You've declared an uninitialised variable with this statement. With some compilers, in debug this may be initialised to 0. In other compilers and a release build, you have no idea what this is pointing to in memory. You may find that when you build and run in release, your program will crash, but appears to be ok in debug. The actual behaviour is undefined.
You need to either create a variable on the stack by doing something like this
char string[100]; // assuming you're not going to receive more than 99 characters (100 including the NULL terminator)
Or, on the heap: -
char string* = (char*)malloc(100);
In which case you'll need to free the character array when you're finished with it.
Assuming you don't know how many characters the user will type, I suggest you keep track in your loop, to ensure you don't try to concatenate beyond the memory you've allocated.
Alternatively, you could limit the number of characters that a user may enter.
const int MAX_CHARS = 100;
char string[MAX_CHARS + 1]; // +1 for Null terminator
int numChars = 0;
while(numChars < MAX_CHARS) && (c=getchar())!= EOF)
{
...
++numChars;
}
As I wrote in comments, you cannot avoid malloc() / calloc() and probably realloc() for a problem such as you have described, where your program does not know until run time how much memory it will need, and must not have any predetermined limit. In addition to the memory management issues on which most of the discussion and answers have focused, however, your code has some additional issues, including:
getchar() returns type int, and to correctly handle all possible inputs you must not convert that int to char before testing against EOF. In fact, for maximum portability you need to take considerable care in converting to char, for if default char is signed, or if its representation has certain other allowed (but rare) properties, then the value returned by getchar() may exceed its maximum value, in which case direct conversion exhibits undefined behavior. (In truth, though, this issue is often ignored, usually to no ill effect in practice.)
Never pass a user-provided string to printf() as the format string. It will not do what you want for some inputs, and it can be exploited as a security vulnerability. If you want to just print a string verbatim then fputs(string, stdout) is a better choice, but you can also safely do printf("%s", string).
Here's a way to approach your problem that addresses all of these issues:
#include <stdio.h>
#include <string.h>
#include <limits.h>
#define INITIAL_BUFFER_SIZE 1024
int main()
{
char *string = malloc(INITIAL_BUFFER_SIZE);
size_t cap = INITIAL_BUFFER_SIZE;
size_t next = 0;
int c;
if (!string) {
// allocation error
return 1;
}
while ((c = getchar()) != EOF) {
if (next + 1 >= cap) {
/* insufficient space for another character plus a terminator */
cap *= 2;
string = realloc(string, cap);
if (!string) {
/* memory reallocation failure */
/* memory was leaked, but it's ok because we're about to exit */
return 1;
}
}
#if (CHAR_MAX != UCHAR_MAX)
/* char is signed; ensure defined behavior for the upcoming conversion */
if (c > CHAR_MAX) {
c -= UCHAR_MAX;
#if ((CHAR_MAX != (UCHAR_MAX >> 1)) || (CHAR_MAX == (-1 * CHAR_MIN)))
/* char's representation has more padding bits than unsigned
char's, or it is represented as sign/magnitude or ones' complement */
if (c < CHAR_MIN) {
/* not representable as a char */
return 1;
}
#endif
}
#endif
string[next++] = (char) c;
}
string[next] = '\0';
fputs(string, stdout);
return 0;
}

Recursion with C

I have the code below to reverse a string recursively, it works when I print the chars after the recursion is finished, but I can not figure out how to assemble the reverse chars into a string and return them reversed to the caller. Anyone have an idea? I don't want to add another parameter to accumulate chars, just this way, this is not homework, I am brushing up on small things since I will be graduating in a year and need to do well on interviews.
char* reversestring5(char* s)
{
int i = 0;
//Not at null terminator
if(*s!=0)
{
//advance the pointer
reversestring5(s+1);
printf("%c\n",*s);
}
}
With a recursive function, it's usually easiest to first figure out how to solve a trivial case (e.g. reversing a string with just a pair of characters) and then see how one might divide up the the problem into simple operations culminating with the trivial case. For example one might do this:
This is the actual recursive function:
char *revrecurse(char *front, char *back)
{
if (front < back) {
char temp = *front;
*front = *back;
*back = temp;
revrecurse(front+1, back-1);
}
return front;
}
This part just uses the recursive function:
char *reverse(char *str)
{
return revrecurse(str, &str[strlen(str)-1]);
}
Note that this assumes that the pointer is valid and that it points to a NUL-terminated string.
If you're going to actually reverse the characters, you can either provide a pair of pointers and recursively swap letters (which is what this routine does) or copy the characters one at a time into yet another space. That's essentially what your original code is doing; copying each character at a time to stdout which is a global structure that is not explicitly passed but is being used by your routine. The analog to that approach, but using pointers might look like this:
#define MAXSTRINGLEN 200
char revbuf[MAXSTRINGLEN];
char *refptr = revbuf;
char *revstring(char *s)
{
if (*s != 0)
{
revstring(s+1);
*refptr++ = *s; /* copy non-NUL characters */
} else {
*refptr++ = '\0'; /* copy NUL character */
}
return revbuf;
}
In this minor modification to your original code, you can now see the reliance of this approach on global variables revbuf and refptr which were hidden inside stdout in your original code. Obviously this is not even close to optimized -- it's intended solely for explanatory purposes.
"Reversing a string recursively" is a very vague statement of a problem, which allows for many completely different solutions.
Note that a "reasonable" solution should avoid making excessive passes over the string. Any solution that begins with strlen is not really a reasonable one. It is recursive for the sake of being recursive and nothing more. If one resorts to making an additional pass over the string, one no longer really needs a recursive solution at all. In other words, any solution that begins with strlen is not really satisfactory.
So, let's look for a more sensible single-pass recursive solution. And you almost got it already.
Your code already taught you that the reverse sequence of characters is obtained on the backtracking phase of recursion. That's exactly where you placed your printf. So, the "straightforward" approach would be to take these reversed characters, and instead of printf-ing them just write them back into the original string starting from the beginning of the string. A naive attempt to do this might look as follows
void reversestring_impl(char* s, char **dst)
{
if (*s != '\0')
{
reversestring_impl(s + 1, dst);
*(*dst)++ = *s;
}
}
void reversestring5(char* s)
{
char *dst = s;
reversestring_impl(s, &dst);
}
Note that this implementation uses an additional parameter dst, which carries the destination location for writing the next output character. That destination location remains unchanged on the forward pass of the recursion, and gets incremented as we write output characters on the backtracking pass of the recursion.
However, the above code will not work properly, since we are working "in place", i.e. using the same string as input and output at the same time. The beginning of the string will get overwritten prematurely. This will destroy character information that will be needed on later backtracking steps. In order to work around this issue each nested level of recursion should save its current character locally before the recursive call and use the saved copy after the recursive call
void reversestring_impl(char* s, char **dst)
{
if (*s != '\0')
{
char c = *s;
reversestring_impl(s + 1, dst);
*(*dst)++ = c;
}
}
void reversestring5(char* s)
{
char *dst = s;
reversestring_impl(s, &dst);
}
int main()
{
char str[] = "123456789";
reversestring5(str);
printf("%s\n", str);
}
The above works as intended.
If you really can't use a helper function and you really can't modify the interface to the function and you really must use recursion, you could do this, horrible though it is:
char *str_reverse(char *str)
{
size_t len = strlen(str);
if (len > 1)
{
char c0 = str[0];
char c1 = str[len-1];
str[len-1] = '\0';
(void)str_reverse(str+1);
str[0] = c1;
str[len-1] = c0;
}
return str;
}
This captures the first and last characters in the string (you could survive without capturing the first), then shortens the string, calls the function on the shortened string, then reinstates the swapped first and last characters. The return value is really of no help; I only kept it to keep the interface unchanged. This is clearest when the recursive call ignores the return value.
Note that this is gruesome for performance because it evaluates strlen() (N/2) times, rather than just once. Given a gigabyte string to reverse, that matters.
I can't think of a good way to write the code without using strlen() or its equivalent. To reverse the string in situ, you have to know where the end is somehow. Since the interface you stipulate does not include the information on where the end is, you have to find the end in the function, somehow. I don't regard strchr(str, '\0') as significantly different from strlen(str), for instance.
If you change the interface to:
void mem_reverse_in_situ(char *start, char *end)
{
if (start < end)
{
char c0 = *start;
*start = *end;
*end = c0;
mem_reverse_in_situ(start+1, end-1);
}
}
Then the reversal code avoids all issues of string length (or memory length) — requiring the calling code to deal with it. The function simply swaps the ends and calls itself on the middle segment. You'd not write this as a recursive function, though; you'd use an iterative solution:
void mem_reverse_in_situ(char *start, char *end)
{
while (start < end)
{
char c0 = *start;
*start++ = *end;
*end-- = c0;
}
}
char* reversestring5(char* s){
size_t len = strlen(s);
char last[2] = {*s};
return (len > 1) ? strcat(memmove(s, reversestring5(s+1), len), last) : s;
}
This is a good question, and the answer involves a technique that apparently few people are familiar with, judging by the other answers. This does the job ... it recursively converts the string into a linked list (kept on the stack, so it's quite efficient) that represents the reversal of the string. It then converts the linked list back into a string (which it does iteratively, but the problem statement doesn't say it can't). There's a complaint in the comments that this is "overkill", but any recursive solution will be overkill ... recursion is simply not a good way to process an array in reverse. But note that there is a whole set of problems that this approach can be applied to where one generates values on the fly rather than having them already available in an array, and then they are to be processed in reverse. Since the OP is interested in developing or brushing up on skills, this answer provides extra value ... and because this technique of creating a linked list on the stack and then consuming the linked list in the termination condition (as it must be, before the memory of the linked list goes out of scope) is apparently not well known. An example is backtrack algorithms such as for the Eight Queens problem.
In response to complaints that this isn't "pure recursive" because of the iterative copy of the list to the string buffer, I've updated it to do it both ways:
#include <stdio.h>
#include <stdlib.h>
typedef struct Cnode Cnode;
struct Cnode
{
char c;
const Cnode* next;
};
static void list_to_string(char* s, const Cnode* np)
{
#ifdef ALL_RECURSIVE
if (np)
{
*s = np->c;
list_to_string(s+1, np->next);
}
else
*s = '\0';
#else
for (; np; np = np->next)
*s++ = np->c;
*s = '\0';
#endif
}
static char* reverse_string_recursive(const char* s, size_t len, const Cnode* np)
{
if (*s)
{
Cnode cn = { *s, np };
return reverse_string_recursive(s+1, len+1, &cn);
}
char* rs = malloc(len+1);
if (rs)
list_to_string(rs, np);
return rs;
}
char* reverse_string(const char* s)
{
return reverse_string_recursive(s, 0, NULL);
}
int main (int argc, char** argv)
{
if (argc > 1)
{
const char* rs = reverse_string(argv[1]);
printf("%s\n", rs? rs : "[malloc failed in reverse_string]");
}
return 0;
}
Here's a "there and back again" [Note 1] in-place reverse which:
doesn't use strlen() and doesn't need to know how long the string is in advance; and
has a maximum recursion depth of half of the string length.
It also never backs up an iterator, so if it were written in C++, it could use a forward iterator. However, that feature is less interesting because it keeps iterators on the stack and requires that you can consistently iterate forward from an iterator, so it can't use input iterators. Still, it does mean that it can be used to in-place reverse values in a singly-linked list, which is possibly slightly interesting.
static void swap(char* lo, char* hi) {
char tmp = *hi;
*hi = *lo;
*lo = tmp;
}
static char* step(char* tortoise, char* hare) {
if (hare[0]) return tortoise;
if (hare[1]) return tortoise + 1;
hare = step(tortoise + 1, hare + 2);
swap(tortoise, hare);
return hare + 1;
}
void reverse_in_place(char* str) { step(str, str); }
Note 1: The "there and back again" pattern comes from a paper by Olivier Danvy and Mayer Goldberg, which makes for fun reading. The paper still seems to be online at ftp://ftp.daimi.au.dk/pub/BRICS/pub/RS/05/3/BRICS-RS-05-3.pdf

Double free error in string operations

Getting double free for below code, if a long string passed.
I tried all sorts of things. If I remove the free(s) line it goes away.
Not sure why it is happening.
void format_str(char *str1,int l,int o) {
char *s = malloc(strlen(str1)+1);
char *s1=s, *b = str1;
int i=0;
while(*str1!='\0') {
i++;
*s1++=*str1++;
if(i>=l) {
if(*str1!=',') {
continue;
}
*s1++=*str1++;
*s1++='\n';
for(i=0;i<o;i++) {
*s1++=' ';
}
i = 0;
}
}
*s1 = '\0';
strcpy(b,s);
free(s);
}
You probably aren't allocating enough space in s for the amount of data you're copying. I don't know what your logic is really doing, but I see stuff like
*s1++=*str1++;
*s1++='\n';
where you're copying more than one character into s (via s1) for a single character from str1.
And for the love of all that is computable, use better variable names!
You are almost certainly corrupting the heap. For example:
int main()
{
char original[1000] = "some,,,string,,, to,,,,format,,,,,";
printf( "original starts out %u characters long\n", strlen(original));
format_str( original, 6, 6);
printf( "original is now %u characters long\n", strlen(original));
return 0;
}
would require that the buffer allocated by malloc() be much larger than strlen(str1)+1 in size. Specifically, it would have to be at least 63 bytes long (as the function is coded in the question, the allocation has a size of 35 bytes).
If you need more specific help, you should describe what you're trying to do (such as what are the parameters l and o for?).
I'll try to reformat your code and guess-rename the variables for sake of mental health.
void format_str(char *str, int minlen, int indent)
{
char *tmpstr = malloc( strlen(str) + 1 ); // here is the problem
char *wrkstr = tmpstr, *savestr = str;
int count = 0;
while ( *str != '\0' ) {
count++;
*wrkstr++ = *str++;
if ( count >= minlen ) {
if ( *str != ',' ) {
continue;
}
*wrkstr++ = *str++;
*wrkstr++ = '\n';
for ( count = 0; count < indent; count++ ) {
*wrkstr ++= ' ';
}
count = 0;
}
}
*wrkstr = '\0';
strcpy(savestr,tmpstr);
free(tmpstr);
}
As others have pointed out you are not allocating sufficient space for the temporary string.
There are two other problems in your code (one of them is a major problem).
You probably should validate your arguments by checking that str is not NULL and maybe also that minlen and indent are not negative. This is not crucial however as a NULL str will just segfault (same behavior of standard library string functions) and values below 1 for minlen and/or indent just behave as if they were 0.
The major problem is with how much space you have in str. You blindly grow the string during formatting and then copy it back to the same memory. This is a buffer overflow waiting to happen (with potentially severe security implications, especially if str happens to point to the stack).
To fix it:
You should allocate sufficient space.
You should either return the allocated string and stipulate that the caller is responsible for freeing it (like strdup does) or add a parameter that specifies the space available in str and then avoid any work if it's not enought to store the formatted string.
The use case is a good example for the need of having a the possibility to do a dry-run.
I'd propose you modify your code like so:
ssize_t format_str(const char * input, int p1, int p2, char * output);
1 the target buffer shall be provided by the function caller via the parameter òutput passed to the function
2 the function shall return the number of characters written into the target buffer (negative values might indicated any sort of errors)
3 if the value passed as output is NULL the function does not copy anything but just parses the data referenced by input and determines how many characters would be written into the target buffer and returns this value.
To then use the conversion function one shall call it twice, like so:
char * input = "some,,test , data,,, ...";
int p1 = <some value>, p2 = <some other value>;
ssize_t ssizeOutput = format_str(input, p1, p2, NULL)
if (0 > ssizeOutput)
{
exit(EXIT_FAILURE);
}
else if (0 < ssizeOutput)
{
char * output = calloc(ssizeOutput, sizeof(*output));
if (!output)
{
exit(EXIT_FAILURE);
}
ssizeOutput = format_str(input, p1, p2, output);
if (0 > ssizeOutput)
{
exit(EXIT_FAILURE);
}
}
As others have pointed out, the heap memory is most likely getting corrupted because the code writes beyond the end of the allocated memory.
To verify whether memory is getting corrupted or not is simple. At beginning of function save the length of str1, let's name it 'len_before'. Before calling free(), get the string length again and let's name it 'len_after'.
if (len_after > len_before) then we have a fatal error.
A relatively simple fix would be to pass in the max length that str1 can grow up to,
malloc that much memory and stop before exceeding the max length, i.e. truncate it with a null but remain within the limit.
int len_before, len_after;
len_before = strlen(str1) + 1;
.
. /* Rest of the code. */
.
len_after = strlen(str1) + 1;
if (len_after > len_before) {
printf("fatal error: buffer overflow by %d bytes.\n", len_after - len_before);
exit(1);
}
free(s);

Simple C string manipulation

I trying to do some very basic string processing in C (e.g. given a filename, chop off the file extension, manipulate filename and then add back on the extension)- I'm rather rusty on C and am getting segmentation faults.
char* fname;
char* fname_base;
char* outdir;
char* new_fname;
.....
fname = argv[1];
outdir = argv[2];
fname_len = strlen(fname);
strncpy(fname_base, fname, (fname_len-4)); // weird characters at the end of the truncation?
strcpy(new_fname, outdir); // getting a segmentation on this I think
strcat(new_fname, "/");
strcat(new_fname, fname_base);
strcat(new_fname, "_test");
strcat(new_fname, ".jpg");
printf("string=%s",new_fname);
Any suggestions or pointers welcome.
Many thanks and apologies for such a basic question
You need to allocate memory for new_fname and fname_base. Here's is how you would do it for new_fname:
new_fname = (char*)malloc((strlen(outdir)+1)*sizeof(char));
In strlen(outdir)+1, the +1 part is for allocating memory for the NULL CHARACTER '\0' terminator.
In addition to what other's are indicating, I would be careful with
strncpy(fname_base, fname, (fname_len-4));
You are assuming you want to chop off the last 4 characters (.???). If there is no file extension or it is not 3 characters, this will not do what you want. The following should give you an idea of what might be needed (I assume that the last '.' indicates the file extension). Note that my 'C' is very rusty (warning!)
char *s;
s = (char *) strrchr (fname, '.');
if (s == 0)
{
strcpy (fname_base, fname);
}
else
{
strncpy (fname_base, fname, strlen(fname)-strlen(s));
fname_base[strlen(fname)-strlen(s)] = 0;
}
You have to malloc fname_base and new_fname, I believe.
ie:
fname_base = (char *)(malloc(sizeof(char)*(fname_len+1)));
fname_base[fname_len] = 0; //to stick in the null termination
and similarly for new_fname and outdir
You're using uninitialized pointers as targets for strcpy-like functions: fname_base and new_fname: you need to allocate memory areas to work on, or declare them as char array e.g.
char fname_base[FILENAME_MAX];
char new_fname[FILENAME_MAX];
you could combine the malloc that has been suggested, with the string manipulations in one statement
if ( asprintf(&new_fname,"%s/%s_text.jpg",outdir,fname_base) >= 0 )
// success, else failed
then at some point, free(new_fname) to release the memory.
(note this is a GNU extension which is also available in *BSD)
Cleaner code:
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
const char *extra = "_test.jpg";
int main(int argc, char** argv)
{
char *fname = strdup(argv[1]); /* duplicate, we need to truncate the dot */
char *outdir = argv[1];
char *dotpos;
/* ... */
int new_size = strlen(fname)+strlen(extra);
char *new_fname = malloc(new_size);
dotpos = strchr(fname, '.');
if(dotpos)
*dotpos = '\0'; /* truncate at the dot */
new_fname = malloc(new_size);
snprintf(new_fname, new_size, "%s%s", fname, extra);
printf("%s\n", new_fname);
return 0;
}
In the following code I do not call malloc.
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
/* Change this to '\\' if you are doing this on MS-windows or something like it. */
#define DIR_SYM '/'
#define EXT_SYM '.'
#define NEW_EXT "jpg"
int main(int argc, char * argv[] ) {
char * fname;
char * outdir;
if (argc < 3) {
fprintf(stderr, "I want more command line arguments\n");
return 1;
}
fname = argv[1];
outdir = argv[2];
char * fname_base_begin = strrchr(fname, DIR_SYM); /* last occurrence of DIR_SYM */
if (!fname_base_begin) {
fname_base_begin = fname; // No directory symbol means that there's nothing
// to chop off of the front.
}
char * fname_base_end = strrchr(fname_base_begin, EXT_SYM);
/* NOTE: No need to search for EXT_SYM in part of the fname that we have cut off
* the front and then have to deal with finding the last EXT_SYM before the last
* DIR_SYM */
if (!fname_base_end) {
fprintf(stderr, "I don't know what you want to do when there is no extension\n");
return 1;
}
*fname_base_end = '\0'; /* Makes this an end of string instead of EXT_SYM */
/* NOTE: In this code I actually changed the string passed in with the previous
* line. This is often not what you want to do, but in this case it should be ok.
*/
// This line should get you the results I think you were trying for in your example
printf("string=%s%c%s_test%c%s\n", outdir, DIR_SYM, fname_base_begin, EXT_SYM, NEW_EXT);
// This line should just append _test before the extension, but leave the extension
// as it was before.
printf("string=%s%c%s_test%c%s\n", outdir, DIR_SYM, fname_base_begin, EXT_SYM, fname_base_end+1);
return 0;
}
I was able to get away with not allocating memory to build the string in because I let printf actually worry about building it, and took advantage of knowing that the original fname string would not be needed in the future.
I could have allocated the space for the string by calculating how long it would need to be based on the parts and then used sprintf to form the string for me.
Also, if you don't want to alter the contents of the fname string you could also have used:
printf("string=%s%c%*s_test%c%s\n", outdir, DIR_SYM, (unsigned)fname_base_begin -(unsigned)fname_base_end, fname_base_begin, EXT_SYM, fname_base_end+1);
To make printf only use part of the string.
The basic of any C string manipulation is that you must write into (and read from unless... ...) memory you "own". Declaring something is a pointer (type *x) reserves space for the pointer, not for the pointee that of course can't be known by magic, and so you have to malloc (or similar) or to provide a local buffer with things like char buf[size].
And you should be always aware of buffer overflow.
As suggested, the usage of sprintf (with a correctly allocated destination buffer) or alike could be a good idea. Anyway if you want to keep your current strcat approach, I remember you that to concatenate strings, strcat have always to "walk" thourgh the current string from its beginning, so that, if you don't need (ops!) buffer overflow checks of any kind, appending chars "by hand" is a bit faster: basically when you finished appending a string, you know where the new end is, and in the next strcat, you can start from there.
But strcat doesn't allow to know the address of the last char appended, and using strlen would nullify the effort. So a possible solution could be
size_t l = strlen(new_fname);
new_fname[l++] = '/';
for(i = 0; fname_base[i] != 0; i++, l++) new_fname[l] = fname_base[i];
for(i = 0; testjpgstring[i] != 0; i++, l++) new_fname[l] = testjpgstring[i];
new_fname[l] = 0; // terminate the string...
and you can continue using l... (testjpgstring = "_test.jpg")
However if your program is full of string manipulations, I suggest using a library for strings (for lazyness I often use glib)

Resources