Double free error in string operations - c

Getting double free for below code, if a long string passed.
I tried all sorts of things. If I remove the free(s) line it goes away.
Not sure why it is happening.
void format_str(char *str1,int l,int o) {
char *s = malloc(strlen(str1)+1);
char *s1=s, *b = str1;
int i=0;
while(*str1!='\0') {
i++;
*s1++=*str1++;
if(i>=l) {
if(*str1!=',') {
continue;
}
*s1++=*str1++;
*s1++='\n';
for(i=0;i<o;i++) {
*s1++=' ';
}
i = 0;
}
}
*s1 = '\0';
strcpy(b,s);
free(s);
}

You probably aren't allocating enough space in s for the amount of data you're copying. I don't know what your logic is really doing, but I see stuff like
*s1++=*str1++;
*s1++='\n';
where you're copying more than one character into s (via s1) for a single character from str1.
And for the love of all that is computable, use better variable names!

You are almost certainly corrupting the heap. For example:
int main()
{
char original[1000] = "some,,,string,,, to,,,,format,,,,,";
printf( "original starts out %u characters long\n", strlen(original));
format_str( original, 6, 6);
printf( "original is now %u characters long\n", strlen(original));
return 0;
}
would require that the buffer allocated by malloc() be much larger than strlen(str1)+1 in size. Specifically, it would have to be at least 63 bytes long (as the function is coded in the question, the allocation has a size of 35 bytes).
If you need more specific help, you should describe what you're trying to do (such as what are the parameters l and o for?).

I'll try to reformat your code and guess-rename the variables for sake of mental health.
void format_str(char *str, int minlen, int indent)
{
char *tmpstr = malloc( strlen(str) + 1 ); // here is the problem
char *wrkstr = tmpstr, *savestr = str;
int count = 0;
while ( *str != '\0' ) {
count++;
*wrkstr++ = *str++;
if ( count >= minlen ) {
if ( *str != ',' ) {
continue;
}
*wrkstr++ = *str++;
*wrkstr++ = '\n';
for ( count = 0; count < indent; count++ ) {
*wrkstr ++= ' ';
}
count = 0;
}
}
*wrkstr = '\0';
strcpy(savestr,tmpstr);
free(tmpstr);
}
As others have pointed out you are not allocating sufficient space for the temporary string.
There are two other problems in your code (one of them is a major problem).
You probably should validate your arguments by checking that str is not NULL and maybe also that minlen and indent are not negative. This is not crucial however as a NULL str will just segfault (same behavior of standard library string functions) and values below 1 for minlen and/or indent just behave as if they were 0.
The major problem is with how much space you have in str. You blindly grow the string during formatting and then copy it back to the same memory. This is a buffer overflow waiting to happen (with potentially severe security implications, especially if str happens to point to the stack).
To fix it:
You should allocate sufficient space.
You should either return the allocated string and stipulate that the caller is responsible for freeing it (like strdup does) or add a parameter that specifies the space available in str and then avoid any work if it's not enought to store the formatted string.

The use case is a good example for the need of having a the possibility to do a dry-run.
I'd propose you modify your code like so:
ssize_t format_str(const char * input, int p1, int p2, char * output);
1 the target buffer shall be provided by the function caller via the parameter òutput passed to the function
2 the function shall return the number of characters written into the target buffer (negative values might indicated any sort of errors)
3 if the value passed as output is NULL the function does not copy anything but just parses the data referenced by input and determines how many characters would be written into the target buffer and returns this value.
To then use the conversion function one shall call it twice, like so:
char * input = "some,,test , data,,, ...";
int p1 = <some value>, p2 = <some other value>;
ssize_t ssizeOutput = format_str(input, p1, p2, NULL)
if (0 > ssizeOutput)
{
exit(EXIT_FAILURE);
}
else if (0 < ssizeOutput)
{
char * output = calloc(ssizeOutput, sizeof(*output));
if (!output)
{
exit(EXIT_FAILURE);
}
ssizeOutput = format_str(input, p1, p2, output);
if (0 > ssizeOutput)
{
exit(EXIT_FAILURE);
}
}

As others have pointed out, the heap memory is most likely getting corrupted because the code writes beyond the end of the allocated memory.
To verify whether memory is getting corrupted or not is simple. At beginning of function save the length of str1, let's name it 'len_before'. Before calling free(), get the string length again and let's name it 'len_after'.
if (len_after > len_before) then we have a fatal error.
A relatively simple fix would be to pass in the max length that str1 can grow up to,
malloc that much memory and stop before exceeding the max length, i.e. truncate it with a null but remain within the limit.
int len_before, len_after;
len_before = strlen(str1) + 1;
.
. /* Rest of the code. */
.
len_after = strlen(str1) + 1;
if (len_after > len_before) {
printf("fatal error: buffer overflow by %d bytes.\n", len_after - len_before);
exit(1);
}
free(s);

Related

random chars in dynamic char array C

I need help with char array. I want to create a n-lenght array and initialize its values, but after malloc() function the array is longer then n*sizeof(char), and the content of array isnt only chars which I assign... In array is few random chars and I dont know how to solve that... I need that part of code for one project for exam in school, and I have to finish by Sunday... Please help :P
#include<stdlib.h>
#include<stdio.h>
int main(){
char *text;
int n = 10;
int i;
if((text = (char*) malloc((n)*sizeof(char))) == NULL){
fprintf(stderr, "allocation error");
}
for(i = 0; i < n; i++){
//text[i] = 'A';
strcat(text,"A");
}
int test = strlen(text);
printf("\n%d\n", test);
puts(text);
free(text);
return 0;
}
Well before using strcat make
text[0]=0;
strcat expects null terminated char array for the first argument also.
From standard 7.24.3.1
#include <string.h>
char *strcat(char * restrict s1,
const char * restrict s2);
The strcat function appends a copy of the string pointed to by s2
(including the terminating null character) to the end of the string
pointed to by s1. The initial character of s2 overwrites the null
character at the end of s1.
How do you think strcat will know where the first string ends if you don't
put a \0 in s1.
Also don't forget to allocate an extra byte for the \0 character. Otherwise you are writing past what you have allocated for. This is again undefined behavior.
And earlier you had undefined behavior.
Note:
You should check the return value of malloc to know whether the malloc invocation was successful or not.
Casting the return value of malloc is not needed. Conversion from void* to relevant pointer is done implicitly in this case.
strlen returns size_t not int. printf("%zu",strlen(text))
To start with, you're way of using malloc in
text = (char*) malloc((n)*sizeof(char)
is not ideal. You can change that to
text = malloc(n * sizeof *text); // Don't cast and using *text is straighforward and easy.
So the statement could be
if(NULL == (text = (char*) malloc((n)*sizeof(char))){
fprintf(stderr, "allocation error");
}
But the actual problem lies in
for(i = 0; i < n; i++){
//text[i] = 'A';
strcat(text,"A");
}
The strcat documentation says
dest − This is pointer to the destination array, which should contain
a C string, and should be large enough to contain the concatenated
resulting string.
Just to point out that the above method is flawed, you just need to consider that the C string "A" actually contains two characters in it, A and the terminating \0(the null character). In this case, when i is n-2, you have out of bounds access or buffer overrun1. If you wanted to fill the entire text array with A, you could have done
for(i = 0; i < n; i++){
// Note for n length, you can store n-1 chars plus terminating null
text[i]=(n-2)==i?'A':'\0'; // n-2 because, the count starts from zero
}
//Then print the null terminated string
printf("Filled string : %s\n",text); // You're all good :-)
Note: Use a tool like valgrind to find memory leaks & out of bound memory accesses.

C - fgets returns corrupt data

I am making a program that should read a file and do some stuff based on it, however I seem to have problem with reading the line properly.
I am using while loop with this function:
static int readLineFromFile(char **destination, int *allocSize, FILE *file)
{
char *newDest;
if (fgets(*destination, (*allocSize) - 2, file))
{
int strend = strlen(*destination);
while (*(*destination + strend - 1) != '\n')
{
printf("\"%s\"\n", *destination);
size_t length = (*allocSize) * 2; //WE ARE GOING TO ALLOCATE MORE MEMORY AND KEEP READING
newDest = realloc(*destination, length);
if (!newDest)
{
free(*destination);
return 2; //2 - FAILED ALLOC
}
*destination = newDest;
fgets(*destination + strend, (length / 2), file);
*allocSize = length;
strend = strlen(*destination);
}
*(*destination+strend-1) = '\0'; //WE DONT NEED THE \n AT THE END, SO WE JUST REPLACE IT WITH \0
printf("D %s\n", *destination);
return 0;
}
else{
if (feof(file)) { //IF NOTHING IS LEFT TO READ, WE CHECK IF IT IS BECAUSE OF EOF (1) OR AN ERROR (2)
return 1;
}
return 2;
}
}
where **destination is pointer to the pointer of mallocated area for the string, *allocSize pointer for the size of **destination and *file the file stream.
However, after 2nd reallocation (so, after I have had 2 lines longer than allocSize and had to realloc, doesn't matter after how many lines exactly), the printf-s (both after the realloc while cycle and at the beginning of the realloc cycle) return 5 chars correctly and 4 random ones after that ("ABCDE????" instead of "ABCDEFGHIJK..."). I thought that it was because of strtok function later in the while cycle (after reading the line), but seems it's not.
Then I found out that later in the code, I allocate space for parsed substrings of the line I loaded - and if I remove this part of the code, everything works fine again (it doesnt even matter, how many bytes I allocate, I still keep getting corrupted strings). So at this point I even started thinking if I haven't broken the OS somehow.
So, it would be nice if someone could tell me, if there is a mistake in this function, or if I should look for it elsewhere. Thanks.
Also, here are the other codes from the while cycle:
STRTOK, fileLine is the *destination from readLine:
char *strtokStart = malloc (strlen (fileLine)+2);
strcpy (strtokStart, fileLine);
char *strtokSave = fileLine;
strcpy (newKey, strtok_r (fileLine, "=", &strtokSave));
strcpy (newValue, strtok_r (NULL, "\n", &strtokSave));
free (strtokStart);
MALLOCATING additional strings:
config->topKey->key = (const char *) malloc(strlen(newKey) + 1);
config->topKey->value = (const char *) malloc(strlen(newValue) + 1);
if (!(config->topKey->key && config->topKey->value))
{
printf("ALLOC FAIL\n");
statusVal = 1;
break;
}
There are two things wrong with readLineFromFile:
If the file does not end in \n, the while loop keeps allocating more memory until realloc fails. Check return code of second fgets do detect that situation!
Bad error handling for failed realloc: free(*destination) leaves caller with bad pointer. Best to just remove free. Or add *destination=NULL.
Otherwise readLineFromFile is fine and should work with valid input files. In the absence of invalid input files, there are probably more errors elsewhere in your code.
readLineFromFile should be called with a malloc'ed buffer, of course.

Storing a string in char* in C

In the code below, I hope you can see that I have a char* variable and that I want to read in a string from a file. I then want to pass this string back from the function. I'm rather confused by pointers so I'm not too sure what I'm supposed to do really.
The purpose of this is to then pass the array to another function to be searched for a name.
Unfortunately the program crashes as a result and I've no idea why.
char* ObtainName(FILE *fp)
{
char* temp;
int i = 0;
temp = fgetc(fp);
while(temp != '\n')
{
temp = fgetc(fp);
i++;
}
printf("%s", temp);
return temp;
}
Any help would be vastly appreciated.
fgetc returns an int, not a char*. This int is a character from the stream, or EOF if you reach the end of the file.
You're implicitly casting the int to a char*, i.e., interpreting it as an address (turn your warnings on.) When you call printf it reads that address and continues to read a character at a time looking for the null terminator which ends the string, but that address is almost certainly invalid. This is undefined behavior.
I've taken some liberties with what you wanted to accomplish. Rather that deal with pointers, you can just use a fixed sized array as long as you can set a maximum length. I've also included several checks so that you don't run off the end of the buffer or the end of the file. Also important is to make sure that you have a null termination '\0' at the end of the string.
#define MAX_LEN 100
char* ObtainName(FILE *fp)
{
static char temp[MAX_LEN];
int i = 0;
while(i < MAX_LEN-1)
{
if (feof(fp))
{
break;
}
temp[i] = fgetc(fp);
if (temp[i] == '\n')
{
break;
}
i++;
}
temp[i] = '\0';
printf("%s", temp);
return temp;
}
So, there are several problems here:
You're not setting aside any storage for the string contents;
You're not storing the string contents correctly;
You're attempting to read memory that doesn't belong to you;
The way you're attempting to return the string is going to give you heartburn.
1. You're not setting aside storage for the string contents
The line
char *temp;
declares temp as a pointer to char; its value will be the address of a single character value. Since it's declared at local scope without the static keyword, its initial value will be indeterminate, and that value may not correspond to a valid memory address.
It does not set aside any storage for the string contents read from fp; that would have to be done as a separate step, which I'll get to below.
2. You're not storing the string contents correctly
The line
temp = fgetc(fp);
reads the next character from fp and assigns it to temp. First of all, this means you're only storing the last character read from the stream, not the whole string. Secondly, and more importantly, you're assigning the result of fgetc() (which returns a value of type int) to an object of type char * (which is treated as an address). You're basically saying "I want to treat the letter 'a' as an address into memory." This brings us to...
3. You're attempting to read memory that doesn't belong to you
In the line
printf("%s", temp);
you're attempting to print out the string beginning at the address stored in temp. Since the last thing you wrote to temp was most likely a character whose value is < 127, you're telling printf to start at a very low and most likely not accessible address, hence the crash.
4. The way you're attempting to return the string is guaranteed to give you heartburn
Since you've defined the function to return a char *, you're going to need to do one of the following:
Allocate memory dynamically to store the string contents, and then pass the responsibility of freeing that memory on to the function calling this one;
Declare an array with the static keyword so that the array doesn't "go away" after the function exits; however, this approach has serious drawbacks;
Change the function definition;
Allocate memory dynamically
You could use dynamic memory allocation routines to set aside a region of storage for the string contents, like so:
char *temp = malloc( MAX_STRING_LENGTH * sizeof *temp );
or
char *temp = calloc( MAX_STRING_LENGTH, sizeof *temp );
and then return temp as you've written.
Both malloc and calloc set aside the number of bytes you specify; calloc will initialize all those bytes to 0, which takes a little more time, but can save your bacon, especially when dealing with text.
The problem is that somebody has to deallocate this memory when its no longer needed; since you return the pointer, whoever calls this function now has the responsibility to call free() when it's done with that string, something like:
void Caller( FILE *fp )
{
...
char *name = ObtainName( fo );
...
free( name );
...
}
This spreads the responsibility for memory management around the program, increasing the chances that somebody will forget to release that memory, leading to memory leaks. Ideally, you'd like to have the same function that allocates the memory free it.
Use a static array
You could declare temp as an array of char and use the static keyword:
static char temp[MAX_STRING_SIZE];
This will set aside MAX_STRING_SIZE characters in the array when the program starts up, and it will be preserved between calls to ObtainName. No need to call free when you're done.
The problem with this approach is that by creating a static buffer, the code is not re-entrant; if ObtainName called another function which in turn called ObtainName again, that new call will clobber whatever was in the buffer before.
Why not just declare temp as
char temp[MAX_STRING_SIZE];
without the static keyword? The problem is that when ObtainName exits, the temp array ceases to exist (or rather, the memory it was using is available for someone else to use). That pointer you return is no longer valid, and the contents of the array may be overwritten before you can access it again.
Change the function definition
Ideally, you'd like for ObtainName to not have to worry about the memory it has to write to. The best way to achieve that is for the caller to pass target buffer as a parameter, along with the buffer's size:
int ObtainName( FILE *fp, char *buffer, size_t bufferSize )
{
...
}
This way, ObtainName writes data into the location that the caller specifies (useful if you want to obtain multiple names for different purposes). The function will return an integer value, which can be a simple success or failure, or an error code indicating why the function failed, etc.
Note that if you're reading text, you don't have to read character by character; you can use functions like fgets() or fscanf() to read an entire string at a time.
Use fscanf if you want to read whitespace-delimited strings (i.e., if the input file contains "This is a test", fscanf( fp, "%s", temp); will only read "This"). If you want to read an entire line (delimited by a newline character), use fgets().
Assuming you want to read an individual string at a time, you'd use something like the following (assumes C99):
#define FMT_SIZE 20
...
int ObtainName( FILE *fp, char *buffer, size_t bufsize )
{
int result = 1; // assume success
int scanfResult = 0;
char fmt[FMT_SIZE];
sprintf( fmt, "%%%zus", bufsize - 1 );
scanfResult = fscanf( fp, fmt, buffer );
if ( scanfResult == EOF )
{
// hit end-of-file before reading any text
result = 0;
}
else if ( scanfResult == 0 )
{
// did not read anything from input stream
result = 0;
}
else
{
result = 1;
}
return result;
}
So what's this noise
char fmt[FMT_SIZE];
sprintf( fmt, "%%%zus", bufsize - 1 );
about? There is a very nasty security hole in fscanf() when you use the %s or %[ conversion specifiers without a maximum length specifier. The %s conversion specifier tells fscanf to read characters until it sees a whitespace character; if there are more non-whitespace characters in the stream than the buffer is sized to hold, fscanf will store those extra characters past the end of the buffer, clobbering whatever memory is following it. This is a common malware exploit. So we want to specify a maximum length for the input; for example, %20s says to read no more than 20 characters from the stream and store them to the buffer.
Unfortunately, since the buffer length is passed in as an argument, we can't write something like %20s, and fscanf doesn't give us a way to specify the length as an argument the way fprintf does. So we have to create a separate format string, which we store in fmt. If the input buffer length is 10, then the format string will be %10s. If the input buffer length is 1000, then the format string will be %1000s.
The following code expands on that in your question, and returns the string in allocated storage:
char* ObtainName(FILE *fp)
{
int temp;
int i = 1;
char *string = malloc(i);
if(NULL == string)
{
fprintf(stderr, "malloc() failed\n");
goto CLEANUP;
}
*string = '\0';
temp = fgetc(fp);
while(temp != '\n')
{
char *newMem;
++i;
newMem=realloc(string, i);
if(NULL==newMem)
{
fprintf(stderr, "realloc() failed.\n");
goto CLEANUP;
}
string=newMem;
string[i-1] = temp;
string[i] = '\0';
temp = fgetc(fp);
}
CLEANUP:
printf("%s", string);
return(string);
}
Take care to 'free()' the string returned by this function, or a memory leak will occur.

Allocating an array of an unknown size

Context: I'm trying to do is to make a program which would take text as input and store it in a character array. Then I would print each element of the array as a decimal. E.g. "Hello World" would be converted to 72, 101, etc.. I would use this as a quick ASCII2DEC converter. I know there are online converters but I'm trying to make this one on my own.
Problem: how can I allocate an array whose size is unknown at compile-time and make it the exact same size as the text I enter? So when I enter "Hello World" it would dynamically make an array with the exact size required to store just "Hello World". I have searched the web but couldn't find anything that I could make use of.
I see that you're using C. You could do something like this:
#define INC_SIZE 10
char *buf = (char*) malloc(INC_SIZE),*temp;
int size = INC_SIZE,len = 0;
char c;
while ((c = getchar()) != '\n') { // I assume you want to read a line of input
if (len == size) {
size += INC_SIZE;
temp = (char*) realloc(buf,size);
if (temp == NULL) {
// not enough memory probably, handle it yourself
}
buf = temp;
}
buf[len++] = c;
}
// done, note that the character array has no '\0' terminator and the length is represented by `len` variable
Typically, on environments like a PC where there are no great memory constraints, I would just dynamically allocate, (language-dependent) an array/string/whatever of, say, 64K and keep an index/pointer/whatever to the current end point plus one - ie. the next index/location to place any new data.
if you use cpp language, you can use the string to store the input characters,and access the character by operator[] , like the following codes:
std::string input;
cin >> input;
I'm going to guess you mean C, as that's one of the commonest compiled languages where you would have this problem.
Variables that you declare in a function are stored on the stack. This is nice and efficient, gets cleaned up when your function exits, etc. The only problem is that the size of the stack slot for each function is fixed and cannot change while the function is running.
The second place you can allocate memory is the heap. This is a free-for-all that you can allocate and deallocate memory from at runtime. You allocate with malloc(), and when finished, you call free() on it (this is important to avoid memory leaks).
With heap allocations you must know the size at allocation time, but it's better than having it stored in fixed stack space that you cannot grow if needed.
This is a simple and stupid function to decode a string to its ASCII codes using a dynamically-allocated buffer:
char* str_to_ascii_codes(char* str)
{
size_t i;
size_t str_length = strlen(str);
char* ascii_codes = malloc(str_length*4+1);
for(i = 0; i<str_length; i++)
snprintf(ascii_codes+i*4, 5, "%03d ", str[i]);
return ascii_codes;
}
Edit: You mentioned in a comment wanting to get the buffer just right. I cut corners with the above example by making each entry in the string a known length, and not trimming the result's extra space character. This is a smarter version that fixes both of those issues:
char* str_to_ascii_codes(char* str)
{
size_t i;
int written;
size_t str_length = strlen(str), ascii_codes_length = 0;
char* ascii_codes = malloc(str_length*4+1);
for(i = 0; i<str_length; i++)
{
snprintf(ascii_codes+ascii_codes_length, 5, "%d %n", str[i], &written);
ascii_codes_length = ascii_codes_length + written;
}
/* This is intentionally one byte short, to trim the trailing space char */
ascii_codes = realloc(ascii_codes, ascii_codes_length);
/* Add new end-of-string marker */
ascii_codes[ascii_codes_length-1] = '\0';
return ascii_codes;
}

The intricacy of a string tokenization function in C

For brushing up my C, I'm writing some useful library code. When it came to reading text files, it's always useful to have a convenient tokenization function that does most of the heavy lifting (looping on strtok is inconvenient and dangerous).
When I wrote this function, I'm amazed at its intricacy. To tell the truth, I'm almost convinced that it contains bugs (especially with memory leaks in case of an allocation error). Here's the code:
/* Given an input string and separators, returns an array of
** tokens. Each token is a dynamically allocated, NUL-terminated
** string. The last element of the array is a sentinel NULL
** pointer. The returned array (and all the strings in it) must
** be deallocated by the caller.
**
** In case of errors, NULL is returned.
**
** This function is much slower than a naive in-line tokenization,
** since it copies the input string and does many allocations.
** However, it's much more convenient to use.
*/
char** tokenize(const char* input, const char* sep)
{
/* strtok ruins its input string, so we'll work on a copy
*/
char* dup;
/* This is the array filled with tokens and returned
*/
char** toks = 0;
/* Current token
*/
char* cur_tok;
/* Size of the 'toks' array. Starts low and is doubled when
** exhausted.
*/
size_t size = 2;
/* 'ntok' points to the next free element of the 'toks' array
*/
size_t ntok = 0;
size_t i;
if (!(dup = strdup(input)))
return NULL;
if (!(toks = malloc(size * sizeof(*toks))))
goto cleanup_exit;
cur_tok = strtok(dup, sep);
/* While we have more tokens to process...
*/
while (cur_tok)
{
/* We should still have 2 empty elements in the array,
** one for this token and one for the sentinel.
*/
if (ntok > size - 2)
{
char** newtoks;
size *= 2;
newtoks = realloc(toks, size * sizeof(*toks));
if (!newtoks)
goto cleanup_exit;
toks = newtoks;
}
/* Now the array is definitely large enough, so we just
** copy the new token into it.
*/
toks[ntok] = strdup(cur_tok);
if (!toks[ntok])
goto cleanup_exit;
ntok++;
cur_tok = strtok(0, sep);
}
free(dup);
toks[ntok] = 0;
return toks;
cleanup_exit:
free(dup);
for (i = 0; i < ntok; ++i)
free(toks[i]);
free(toks);
return NULL;
}
And here's simple usage:
int main()
{
char line[] = "The quick brown fox jumps over the lazy dog";
char** toks = tokenize(line, " \t");
int i;
for (i = 0; toks[i]; ++i)
printf("%s\n", toks[i]);
/* Deallocate
*/
for (i = 0; toks[i]; ++i)
free(toks[i]);
free(toks);
return 0;
}
Oh, and strdup:
/* strdup isn't ANSI C, so here's one...
*/
char* strdup(const char* str)
{
size_t len = strlen(str) + 1;
char* dup = malloc(len);
if (dup)
memcpy(dup, str, len);
return dup;
}
A few things to note about the code of the tokenize function:
strtok has the impolite habit of writing over its input string. To save the user's data, I only call it on a duplicate of the input. The duplicate is obtained using strdup.
strdup isn't ANSI-C, however, so I had to write one
The toks array is grown dynamically with realloc, since we have no idea in advance how many tokens there will be. The initial size is 2 just for testing, in real-life code I would probably set it to a much higher value. It's also returned to the user, and the user has to deallocate it after use.
In all cases, extreme care is taken not to leak resources. For example, if realloc returns NULL, it won't run over the old pointer. The old pointer will be released and the function returns. No resources leak when tokenize returns (except in the nominal case where the array returned to the user must be deallocated after use).
A goto is used for more convenient cleanup code, according to the philosophy that goto can be good in some cases (this is a good example, IMHO).
The following function can help with simple deallocation in a single call:
/* Given a pointer to the tokens array returned by 'tokenize',
** frees the array and sets it to point to NULL.
*/
void tokenize_free(char*** toks)
{
if (toks && *toks)
{
int i;
for (i = 0; (*toks)[i]; ++i)
free((*toks)[i]);
free(*toks);
*toks = 0;
}
}
I'd really like to discuss this code with other users of SO. What could've been done better? Would you recommend a difference interface to such a tokenizer? How is the burden of deallocation taken from the user? Are there memory leaks in the code anyway?
Thanks in advance
One thing I would recommend is to provide tokenize_free that handles all the deallocations. It's easier on the user and gives you the flexibility to change your allocation strategy in the future without breaking users of your library.
The code below fails when the first character of the string is a separator:
One additional idea is not to bother duplicating each individual token. I don't see what it adds and just gives you more places where the code can file. Instead, just keep the duplicate of the full buffer you made. What I mean is change:
toks[ntok] = strdup(cur_tok);
if (!toks[ntok])
goto cleanup_exit;
to:
toks[ntok] = cur_tok;
Drop the line free(buf) from the non-error path. Finally, this changes cleanup to:
free(toks[0]);
free(toks);
You don't need to strdup() each token; you duplicate the input string, and could let strtok() chop that up. It simplifies releasing the resources afterwards, too - you only have to release the array of pointers and the single string.
I agree with those who say that you need a function to release the data - unless you change the interface radically and have the user provide the array of pointers as an input parameter, and then you would probably also decide that the user is responsible for duplicating the string if it must be preserved. That leads to an interface:
int tokenize(char *source, const char *sep, char **tokens, size_t max_tokens);
The return value would be the number of tokens found.
You have to decide what to do when there are more tokens than slots in the array. Options include:
returning an error indication (negative number, likely -1), or
the full number of tokens found but the pointers that can't be assigned aren't, or
just the number of tokens that fitted, or
one more than the number of tokens, indicating that there were more, but no information on exactly how many more.
I chose to return '-1', and it lead to this code:
/*
#(#)File: $RCSfile: tokenise.c,v $
#(#)Version: $Revision: 1.9 $
#(#)Last changed: $Date: 2008/02/11 08:44:50 $
#(#)Purpose: Tokenise a string
#(#)Author: J Leffler
#(#)Copyright: (C) JLSS 1987,1989,1991,1997-98,2005,2008
#(#)Product: :PRODUCT:
*/
/*TABSTOP=4*/
/*
** 1. A token is 0 or more characters followed by a terminator or separator.
** The terminator is ASCII NUL '\0'. The separators are user-defined.
** 2. A leading separator is preceded by a zero-length token.
** A trailing separator is followed by a zero-length token.
** 3. The number of tokens found is returned.
** The list of token pointers is terminated by a NULL pointer.
** 4. The routine returns 0 if the arguments are invalid.
** It returns -1 if too many tokens were found.
*/
#include "jlss.h"
#include <string.h>
#define NO 0
#define YES 1
#define IS_SEPARATOR(c,s,n) (((c) == *(s)) || ((n) > 1 && strchr((s),(c))))
#define DIM(x) (sizeof(x)/sizeof(*(x)))
#ifndef lint
/* Prevent over-aggressive optimizers from eliminating ID string */
const char jlss_id_tokenise_c[] = "#(#)$Id: tokenise.c,v 1.9 2008/02/11 08:44:50 jleffler Exp $";
#endif /* lint */
int tokenise(
char *str, /* InOut: String to be tokenised */
char *sep, /* In: Token separators */
char **token, /* Out: Pointers to tokens */
int maxtok, /* In: Maximum number of tokens */
int nulls) /* In: Are multiple separators OK? */
{
int c;
int n_tokens;
int tokenfound;
int n_sep = strlen(sep);
if (n_sep <= 0 || maxtok <= 2)
return(0);
n_tokens = 1;
*token++ = str;
while ((c = *str++) != '\0')
{
tokenfound = NO;
while (c != '\0' && IS_SEPARATOR(c, sep, n_sep))
{
tokenfound = YES;
*(str - 1) = '\0';
if (nulls)
break;
c = *str++;
}
if (tokenfound)
{
if (++n_tokens >= maxtok - 1)
return(-1);
if (nulls)
*token++ = str;
else
*token++ = str - 1;
}
if (c == '\0')
break;
}
*token++ = 0;
return(n_tokens);
}
#ifdef TEST
struct
{
char *sep;
int nulls;
} data[] =
{
{ "/.", 0 },
{ "/.", 1 },
{ "/", 0 },
{ "/", 1 },
{ ".", 0 },
{ ".", 1 },
{ "", 0 }
};
static char string[] = "/fred//bill.c/joe.b/";
int main(void)
{
int i;
int j;
int n;
char input[100];
char *token[20];
for (i = 0; i < DIM(data); i++)
{
strcpy(input, string);
printf("\n\nTokenising <<%s>> using <<%s>>, null %d\n",
input, data[i].sep, data[i].nulls);
n = tokenise(input, data[i].sep, token, DIM(token),
data[i].nulls);
printf("Return value = %d\n", n);
for (j = 0; j < n; j++)
printf("Token %d: <<%s>>\n", j, token[j]);
if (n > 0)
printf("Token %d: 0x%08lX\n", n, (unsigned long)token[n]);
}
return(0);
}
#endif /* TEST */
I don't see anything wrong with the strtok approach to modifying a string in-line - it's the callers choice if they want to operate on a duplicated string or not as the semantics are well understood. Below is the same method slightly simplified to use strtok as intended, yet still return a handy array of char * pointers (which now simply point to the tokenized segments of the original string). It gives the same output for your original main() call.
The main advantage of this approach is that you only have to free the returned character array, instead of looping through to clear all of the elements - an aspect which I thought took away a lot of the simplicity factor and something a caller would be very unlikely to expect to do by any normal C convention.
I also took out the goto statements, because with the code refactored they just didn't make much sense to me. I think the danger of having a single cleanup point is that it can start to grow too unwieldy and do extra steps that are not needed to clean up issues at specific locations.
Personally I think the main philosophical point I would make is that you should respect what other people using the language are going to expect, especially when creating library kinds of calls. Even if the strtok replacement behavior seems odd to you, the vast majority of C programmers are used to placing \0 in the middle of C strings to split them up or create shorter strings and so this will seem quite natural. Also as noted no-one is going to expect to do anything beyond a single free() with the return value from a function. You need to write your code in whatever way needed to make sure then that the code works that way, as people will simply not read any documentation you might offer and will instead act according to the memory convention of your return value (which is char ** so a caller would expect to have to free that).
char** tokenize(char* input, const char* sep)
{
/* Size of the 'toks' array. Starts low and is doubled when
** exhausted.
*/
size_t size = 4;
/* 'ntok' points to the next free element of the 'toks' array
*/
size_t ntok = 0;
/* This is the array filled with tokens and returned
*/
char** toks = malloc(size * sizeof(*toks));
if ( toks == NULL )
return;
toks[ntok] = strtok( input, sep );
/* While we have more tokens to process...
*/
do
{
/* We should still have 2 empty elements in the array,
** one for this token and one for the sentinel.
*/
if (ntok > size - 2)
{
char** newtoks;
size *= 2;
newtoks = realloc(toks, size * sizeof(*toks));
if (newtoks == NULL)
{
free(toks);
return NULL;
}
toks = newtoks;
}
ntok++;
toks[ntok] = strtok(0, sep);
} while (toks[ntok]);
return toks;
}
Just a few things:
Using gotos is not intrinsically evil or bad, much like the preprocessor, they are often abused. In cases like yours where you have to exit a function differently depending on how things went, they are appropriate.
Provide a functional means of freeing the returned array. I.e. tok_free(pointer).
Use the re-entrant version of strtok() initially, i.e. strtok_r(). It would not be cumbersome for someone to pass an additional argument (even NULL if not needed) for that.
there is a great tools to detect Memory leak which is called Valgrind.
http://valgrind.org/
If you want to find memory leaks, one possibility is to run it with valgrind.

Resources