Is there a straightforward way to copy argv in C? - c

When writing small UNIX utilities in C I sometimes want to take a copy of argv provided to main() so that I can tweak the parameters before calling exec(). The following is based loosely on an implementation from BSD's xargs(1):
void run_script(char *argv[]) {
char **tmp, **new_argv;
int argc;
for (argc=0; argv[argc] != 0; argc++);
new_argv = malloc((argc + 1) * sizeof(char *));
for (tmp=new_argv; *argv != 0; tmp++) {
*tmp = *argv++;
*tmp = strdup(*tmp);
}
/* call execvp(3) using new_argv */
for (i=0; i<=argc; i++)
free(new_argv[i]);
free(new_argv);
}
This feels more complex than it needs to be. Is there a better way to write this in C? (Perhaps filling in a single buffer that was allocated using _POSIX_ARG_MAX.)

My C is a little rusty, however....
On the assumption that argv[] is going to be available for the duration of your script, you don't need to strdup strings if they are not going to be changed. After you allocate the memory (remembering to add more to argc if you want to add extra parameters), you should just be able to memcpy. Your new_argv will point to the same strings as your original argv, so you will need to strdup before you change it (or not, if you want to set it to some other string).
I think you need to have a close look at your loop, for (m=0, tmp.. - for a start, I don't think m is used anywhere else; second, you are testing *argv != 0, but not changing *argv in the loop, so it will run forever or not at all, and third, at the start of this loop, all your new_argv are null pointers, so *tmp = strdump(*tmp); will try to duplicate a string at memory address 0, having nothing to do with argv?

I came up with another process for duplicating argv. The following example isn't shorter, but I think it easier to read:
void run_script(char *argv[]) {
int i;
char **new_argv;
char *p, *arg_buf;
int argc;
for (argc=0; argv[argc]; argc++);
arg_buf = malloc(ARG_MAX);
new_argv = calloc(argc+1, sizeof(char *));
for (i=0, p=arg_buf; i<argc; i++) {
p += strlcpy(p, argv[i], ARG_MAX - (p - arg_buf));
p++;
}
/* call execvp(3) using new_argv */
free(arg_buf);
free(new_argv);
}
On Linux ARG_MAX should be replaced with sysconf(_SC_ARG_MAX)

Related

Copying strings from extern char environ in C

I have a question pertaining to the extern char **environ. I'm trying to make a C program that counts the size of the environ list, copies it to an array of strings (array of array of chars), and then sorts it alphabetically with a bubble sort. It will print in name=value or value=name order depending on the format value.
I tried using strncpy to get the strings from environ to my new array, but the string values come out empty. I suspect I'm trying to use environ in a way I can't, so I'm looking for help. I've tried to look online for help, but this particular program is very limited. I cannot use system(), yet the only help I've found online tells me to make a program to make this system call. (This does not help).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
extern char **environ;
int main(int argc, char *argv[])
{
char **env = environ;
int i = 0;
int j = 0;
printf("Hello world!\n");
int listSZ = 0;
char temp[1024];
while(env[listSZ])
{
listSZ++;
}
printf("DEBUG: LIST SIZE = %d\n", listSZ);
char **list = malloc(listSZ * sizeof(char**));
char **sorted = malloc(listSZ * sizeof(char**));
for(i = 0; i < listSZ; i++)
{
list[i] = malloc(sizeof(env[i]) * sizeof(char)); // set the 2D Array strings to size 80, for good measure
sorted[i] = malloc(sizeof(env[i]) * sizeof(char));
}
while(env[i])
{
strncpy(list[i], env[i], sizeof(env[i]));
i++;
} // copy is empty???
for(i = 0; i < listSZ - 1; i++)
{
for(j = 0; j < sizeof(list[i]); j++)
{
if(list[i][j] > list[i+1][j])
{
strcpy(temp, list[i]);
strcpy(list[i], list[i+1]);
strcpy(list[i+1], temp);
j = sizeof(list[i]); // end loop, we resolved this specific entry
}
// else continue
}
}
This is my code, help is greatly appreciated. Why is this such a hard to find topic? Is it the lack of necessity?
EDIT: Pasted wrong code, this was a separate .c file on the same topic, but I started fresh on another file.
In a unix environment, the environment is a third parameter to main.
Try this:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main(int argc, char *argv[], char **envp)
{
while (*envp) {
printf("%s\n", *envp);
*envp++;
}
}
There are multiple problems with your code, including:
Allocating the 'wrong' size for list and sorted (you multiply by sizeof(char **), but should be multiplying by sizeof(char *) because you're allocating an array of char *. This bug won't actually hurt you this time. Using sizeof(*list) avoids the problem.
Allocating the wrong size for the elements in list and sorted. You need to use strlen(env[i]) + 1 for the size, remembering to allow for the null that terminates the string.
You don't check the memory allocations.
Your string copying loop is using strncpy() and shouldn't (actually, you should seldom use strncpy()), not least because it is only copying 4 or 8 bytes of each environment variable (depending on whether you're on a 32-bit or 64-bit system), and it is not ensuring that they're null terminated strings (just one of the many reasons for not using strncpy().
Your outer loop of your 'sorting' code is OK; your inner loop is 100% bogus because you should be using the length of one or the other string, not the size of the pointer, and your comparisons are on single characters, but you're then using strcpy() where you simply need to move pointers around.
You allocate but don't use sorted.
You don't print the sorted environment to demonstrate that it is sorted.
Your code is missing the final }.
Here is some simple code that uses the standard C library qsort() function to do the sorting, and simulates POSIX strdup()
under the name dup_str() — you could use strdup() if you have POSIX available to you.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
extern char **environ;
/* Can also be spelled strdup() and provided by the system */
static char *dup_str(const char *str)
{
size_t len = strlen(str) + 1;
char *dup = malloc(len);
if (dup != NULL)
memmove(dup, str, len);
return dup;
}
static int cmp_str(const void *v1, const void *v2)
{
const char *s1 = *(const char **)v1;
const char *s2 = *(const char **)v2;
return strcmp(s1, s2);
}
int main(void)
{
char **env = environ;
int listSZ;
for (listSZ = 0; env[listSZ] != NULL; listSZ++)
;
printf("DEBUG: Number of environment variables = %d\n", listSZ);
char **list = malloc(listSZ * sizeof(*list));
if (list == NULL)
{
fprintf(stderr, "Memory allocation failed!\n");
exit(EXIT_FAILURE);
}
for (int i = 0; i < listSZ; i++)
{
if ((list[i] = dup_str(env[i])) == NULL)
{
fprintf(stderr, "Memory allocation failed!\n");
exit(EXIT_FAILURE);
}
}
qsort(list, listSZ, sizeof(list[0]), cmp_str);
for (int i = 0; i < listSZ; i++)
printf("%2d: %s\n", i, list[i]);
return 0;
}
Other people pointed out that you can get at the environment via a third argument to main(), using the prototype int main(int argc, char **argv, char **envp). Note that Microsoft explicitly supports this. They're correct, but you can also get at the environment via environ, even in functions other than main(). The variable environ is unique amongst the global variables defined by POSIX in not being declared in any header file, so you must write the declaration yourself.
Note that the memory allocation is error checked and the error reported on standard error, not standard output.
Clearly, if you like writing and debugging sort algorithms, you can avoid using qsort(). Note that string comparisons need to be done using strcmp(), but you can't use strcmp() directly with qsort() when you're sorting an array of pointers because the argument types are wrong.
Part of the output for me was:
DEBUG: Number of environment variables = 51
0: Apple_PubSub_Socket_Render=/private/tmp/com.apple.launchd.tQHOVHUgys/Render
1: BASH_ENV=/Users/jleffler/.bashrc
2: CDPATH=:/Users/jleffler:/Users/jleffler/src:/Users/jleffler/src/perl:/Users/jleffler/src/sqltools:/Users/jleffler/lib:/Users/jleffler/doc:/Users/jleffler/work:/Users/jleffler/soq/src
3: CLICOLOR=1
4: DBDATE=Y4MD-
…
47: VISUAL=vim
48: XPC_FLAGS=0x0
49: XPC_SERVICE_NAME=0
50: _=./pe17
If you want to sort the values instead of the names, you have to do some harder work. You'd need to define what output you wish to see. There are multiple ways of handling that sort.
To get the environment variables, you need to declare main like this:
int main(int argc, char **argv, char **env);
The third parameter is the NULL-terminated list of environment variables. See:
#include <stdio.h>
int main(int argc, char **argv, char **environ)
{
for(size_t i = 0; env[i]; ++i)
puts(environ[i]);
return 0;
}
The output of this is:
LD_LIBRARY_PATH=/home/shaoran/opt/node-v6.9.4-linux-x64/lib:
LS_COLORS=rs=0:di=01;34:ln=01;36:m
...
Note also that sizeof(environ[i]) in your code does not get you the length of
the string, it gets you the size of a pointer, so
strncpy(list[i], environ[i], sizeof(environ[i]));
is wrong. Also the whole point of strncpy is to limit based on the destination,
not on the source, otherwise if the source is larger than the destination, you
will still overflow the buffer. The correct call would be
strncpy(list[i], environ[i], 80);
list[i][79] = 0;
Bare in mind that strncpy might not write the '\0'-terminating byte if the
destination is not large enough, so you have to make sure to terminate the
string. Also note that 79 characters might be too short for storing env variables. For example, my LS_COLORS variable
is huge, at least 1500 characters long. You might want to do your list[i] = malloc calls based based on strlen(environ[i])+1.
Another thing: your swapping
strcpy(temp, list[i]);
strcpy(list[i], list[i+1]);
strcpy(list[i+1], temp);
j = sizeof(list[i]);
works only if all list[i] point to memory of the same size. Since the list[i] are pointers, the cheaper way of swapping would be by
swapping the pointers instead:
char *tmp = list[i];
list[i] = list[i+1];
list[i+1] = tmp;
This is more efficient, is a O(1) operation and you don't have to worry if the
memory spaces are not of the same size.
What I don't get is, what do you intend with j = sizeof(list[i])? Not only
that sizeof(list[i]) returns you the size of a pointer (which will be constant
for all list[i]), why are you messing with the running variable j inside the
block? If you want to leave the loop, the do break. And you are looking for
strlen(list[i]): this will give you the length of the string.

Memory Management Command Line Arguments in C

I am writing a simple program that takes the command line arguments and stores them into a char **. I am trying to learn more about memory management but cannot get past this simple stumbling block. My program is supposed to copy the command line argumetns into a dynamicly allocated char **. However the first position in my array is always corrupter. Below is the code and what it prints:
if (strcmp(argv[1], "test") ==0)
{
test();
}
else
{
char ** file_names = malloc(10);
for(int i =0; i < argc-1; ++i)
{
file_names[i] = malloc(10);
strcpy(file_names[i], argv[i+1]);
printf("%s, %s\n", argv[i+1], file_names[i]);
}
printf("____________\n");
for(int i =0; i < argc-1; ++i)
{
printf("%s\n", file_names[i]);
}
}
and the out come is:
what, what
test, test
again, again
wow, wow
____________
pK#??
test
again
wow
can someone please explain why this is happening? Thanks
This:
char ** file_names = malloc(10);
is a bug. It attempts to allocate 10 bytes, which has nothing at all to do with how many bytes you need. Under-allocating and then overwriting gives you undefined behavior.
It should be something like:
char **file_names = malloc(argc * sizeof *file_names);
This computes the size of the allocation by multiplying the number of arguments (argc, if you don't want to store argv[0] then this should be (argc - 1), of course) by the size of a character pointer, which is expressed as sizeof *file_names. Since file_names is of type char * *, the type of *file_names is char *, which is what you want. This is a general pattern, it can be applied very often and it lets you stop repeating the type name. It can protect you from errors.
For instance compare:
double *floats = malloc(1024 * sizeof(float)); /* BAD CODE */
and:
double *floats = malloc(1024 * sizeof *floats);
If you imagine that originally it was float *floats (as the naming suggests) then the first variant contains another under-allocation bug, while the second "survived" the change of types without error.
Then you need to check that it succeeded, before assuming it did.
You want to allocate the right amount of memory for file_names, probably more like:
char ** file_names = malloc(sizeof(char*) * (argc - 1));

Can't copy characters from pointer to another pointer(both with memory allocated)

I have a program that accepts a char input using argv from the command line. I copy the input argv[1] using strcpy to a pointer called structptr(it goes to structptr->words from struct) where memory has been allocated. I then copy character by character from the memory that the pointer structptr points to another pointer called words that points to memory that has been allocated. After i've copied one character i print that element [c] to make sure that it has been copied correctly(which it has). I then finish copying all of the characters and return the result to a char pointer but for some reason it is blank/null. After each copying of the characters i checked if the previous elements were correct but they don't show up anymore([c-2], [c-1], [c]). Here is my code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
struct StructHolder {
char *words;
};
typedef struct StructHolder Holder;
char *GetCharacters(Holder *ptr){
int i=0;
char *words=malloc(sizeof(char));
for(i;i<strlen(ptr->words);i++){
words[i]=ptr->words[i];
words=realloc(words,sizeof(char)+i);
}
words[strlen(ptr->words)]='\0';
return words;
}
int main(int argc, char **argv){
Holder *structptr=malloc(sizeof(Holder));
structptr->words=malloc(strlen(argv[1]));
strcpy(structptr->words, argv[1]);
char *charptr;
charptr=(GetCharacters(structptr));
printf("%s\n", charptr);
return 0;
At first I thought this was the problem:
char *words=malloc(sizeof(char)) is allocating 1 byte (sizeof 1 char). You probably meant char *words = malloc(strlen(ptr->words)+1); - You probably want to null check the ptr and it's member just to be safe.
Then I saw the realloc. Your realloc is always 1 char short. When i = 0 you allocate 1 byte then hit the loop, increment i and put a char 1 past the end of the realloced array (at index 1)
Also your strcpy in main is has not allocated any memory in the holder.
In these two lines,
structptr->words=malloc(strlen(argv[1]));
strcpy(structptr->words, argv[1]);
need to add one to the size to hold the nul-terminator. strlen(argv[1]) should be strlen(argv[1])+1.
I think the same thing is happening in the loop, and it should be larger by 1. And sizeof(char) is always 1 by definition, so:
...
words=realloc(words,i+2);
}
words=realloc(words,i+2); // one more time to make room for the '\0'
words[strlen(ptr->words)]='\0';
FYI: Your description talks about structptr but your code uses struct StructHolder and Holder.
This code is a disaster:
char *GetCharacters(Holder *ptr){
int i=0;
char *words=malloc(sizeof(char));
for(i;i<strlen(ptr->words);i++){
words[i]=ptr->words[i];
words=realloc(words,sizeof(char)+i);
}
words[strlen(ptr->words)]='\0';
return words;
}
It should be:
char *GetCharacters(const Holder *ptr)
{
char *words = malloc(strlen(ptr->words) + 1);
if (words != 0)
strcpy(words, ptr->words);
return words;
}
Or even:
char *GetCharacters(const Holder *ptr)
{
return strdup(ptr->words);
}
And all of those accept that passing the structure type makes sense; there's no obvious reason why you don't just pass the const char *words instead.
Dissecting the 'disaster' (and ignoring the argument type):
char *GetCharacters(Holder *ptr){
int i=0;
OK so far, though you're not going to change the structure so it could be a const Holder *ptr argument.
char *words=malloc(sizeof(char));
Allocating one byte is expensive — more costly than calling strlen(). This is not a good start, though of itself, it is not wrong. You do not, however, check that the memory allocation succeeded. That is a mistake.
for(i;i<strlen(ptr->words);i++){
The i; first term is plain weird. You could write for (i = 0; ... (and possibly omit the initializer in the definition of i, or you could write for (int i = 0; ....
Using strlen() repeatedly in a loop like that is bad news too. You should be using:
int len = strlen(ptr->words);
for (i = 0; i < len; i++)
Next:
words[i]=ptr->words[i];
This assignment is not a problem.
words=realloc(words,sizeof(char)+i);
This realloc() assignment is a problem. If you get back a null pointer, you've lost the only reference to the previously allocated memory. You need, therefore, to save the return value separately, test it, and only assign if successful:
void *space = realloc(words, i + 2); // When i = 0, allocate 2 bytes.
if (space == 0)
break;
words = space;
This would be better/safer. It isn't completely clean; it might be better to replace break; with { free(words); return 0; } to do an early exit. But this whole business of allocating one byte at a time is not the right way to do it. You should work out how much space to allocate, then allocate it all at once.
}
words[strlen(ptr->words)]='\0';
You could avoid recalculating the length by using i instead of strlen(ptr->words). This would have the side benefit of being correct if the if (space == 0) break; was executed.
return words;
}
The rest of this function is OK.
I haven't spent time analyzing main(); it is not, however, problem-free.

building a string out of a variable amount of arguments

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdarg.h>
int main(int argc, char * argv[])
{
char *arr[] = { "ab", "cd", "ef" };
char **ptr, **p, *str;
int num = 3;
int size = 0;
ptr = calloc(num, 4);
p = ptr;
for (; num > 0; num--)
size += strlen(*(p++) = arr[num - 1]);
str = calloc(1, ++size);
sprintf(str, "%s%s%s", ptr[0], ptr[1], ptr[2]);
printf("%s\n", str);
return 0;
}
output: "efcdab" as expected.
now, this is all fine and suitable if the argument count to sprintf is predetermined and known. what i'm trying to achieve, however, is an elegant way of building a string if the argument count is variable (ptr[any]).
first problem: 2nd argument that is required to be passed to sprintf is const char *format.
second: the 3rd argument is the actual amount of passed on arguments in order to build the string based on the provided format.
how can i achieve something of the following:
sprintf(str, "...", ...)
basically, what if the function receives 4 (or more) char pointers out of which i want to build a whole string (currently, within the code provided above, there's only 3). that would mean, that the 2nd argument must be (at least) in the form of "%s%s%s%s", followed by an argument list of ptr[0], ptr[1], ptr[2], ptr[3].
how can make such a 'combined' call, to sprintf (or vsprintf), in the first place? things would be easier, if i could just provide a whole pointer array (**ptr) as the 3rd argument, instead.. but that does not seem to be feasible? at least, not in a way that sprintf would understand it, so it seems.. as it would need some special form of format.
ideas / suggestions?
karlphillip's suggestion of strcat does seem to be the solution here. Or rather, you'd more likely want to use something like strncat (though if you're working with a C library that supports it, I'd recommend strlcat, which, in my opinion, is much better than strncat).
So, rather than sprintf(str, "%s%s%s", ptr[0], ptr[1], ptr[2]);, you could do something like this:
int i;
for (i = 0; i < any; i++)
strncat(str, arr[i], size - strlen(str) - 1);
(Or strlcat(str, arr[i], size);; the nice thing about strlcat is that its return value will indicate how many bytes are needed for reallocation if the destination buffer is too small, but it's not a standard C function and a lot of systems don't support it.)
There's no other way to do this in C without manipulating buffers.
You could, however, switch to C++ and use the fabulous std::string to make your life easier.
Your first problem is handled by: const char * is for the function, not you. Put together your own string -- that signature just means that the function won't change it.
Your second problem is handled by: pass in your own va_list. How do you get it? Make your own varargs function:
char *assemble_strings(int count, ...)
{
va_list data_list;
va_list len_list;
int size;
char *arg;
char *formatstr;
char *str;
int i;
va_start(len_list, count);
for (i = 0, size = 0; i < count; i++)
{
arg = va_arg(len_list, char *);
size += strlen(arg);
}
va_end(len_list);
formatstr = malloc(2*count + 1);
formatstr[2*count] = 0;
for (i = 0; i < count; i++)
{
formatstr[2*i] = '%';
formatstr[2*i+1] = 's';
}
str = malloc(size + 1);
va_start(data_list, count);
vsprintf(str, formatstr, data_list);
va_end(data_list);
free(formatstr);
return(str);
}
You'll need some way to terminate the varargs, of course, and it's much easier to just pass it to vsprintf if the string list is entirely within the varargs -- since standard C requires at least one regular argument.
The loop I would use for the final copy into str would be something like:
for(i=0, p=str; i < num; i++)
p += sprintf(p, "%s", ptr[i]);
or
for(i=0, p=str; i < num; i++)
p += strlen(strcpy(p, ptr[i]));
rather than trying to deal with a variable number of arguments in a single call to sprintf.

The intricacy of a string tokenization function in C

For brushing up my C, I'm writing some useful library code. When it came to reading text files, it's always useful to have a convenient tokenization function that does most of the heavy lifting (looping on strtok is inconvenient and dangerous).
When I wrote this function, I'm amazed at its intricacy. To tell the truth, I'm almost convinced that it contains bugs (especially with memory leaks in case of an allocation error). Here's the code:
/* Given an input string and separators, returns an array of
** tokens. Each token is a dynamically allocated, NUL-terminated
** string. The last element of the array is a sentinel NULL
** pointer. The returned array (and all the strings in it) must
** be deallocated by the caller.
**
** In case of errors, NULL is returned.
**
** This function is much slower than a naive in-line tokenization,
** since it copies the input string and does many allocations.
** However, it's much more convenient to use.
*/
char** tokenize(const char* input, const char* sep)
{
/* strtok ruins its input string, so we'll work on a copy
*/
char* dup;
/* This is the array filled with tokens and returned
*/
char** toks = 0;
/* Current token
*/
char* cur_tok;
/* Size of the 'toks' array. Starts low and is doubled when
** exhausted.
*/
size_t size = 2;
/* 'ntok' points to the next free element of the 'toks' array
*/
size_t ntok = 0;
size_t i;
if (!(dup = strdup(input)))
return NULL;
if (!(toks = malloc(size * sizeof(*toks))))
goto cleanup_exit;
cur_tok = strtok(dup, sep);
/* While we have more tokens to process...
*/
while (cur_tok)
{
/* We should still have 2 empty elements in the array,
** one for this token and one for the sentinel.
*/
if (ntok > size - 2)
{
char** newtoks;
size *= 2;
newtoks = realloc(toks, size * sizeof(*toks));
if (!newtoks)
goto cleanup_exit;
toks = newtoks;
}
/* Now the array is definitely large enough, so we just
** copy the new token into it.
*/
toks[ntok] = strdup(cur_tok);
if (!toks[ntok])
goto cleanup_exit;
ntok++;
cur_tok = strtok(0, sep);
}
free(dup);
toks[ntok] = 0;
return toks;
cleanup_exit:
free(dup);
for (i = 0; i < ntok; ++i)
free(toks[i]);
free(toks);
return NULL;
}
And here's simple usage:
int main()
{
char line[] = "The quick brown fox jumps over the lazy dog";
char** toks = tokenize(line, " \t");
int i;
for (i = 0; toks[i]; ++i)
printf("%s\n", toks[i]);
/* Deallocate
*/
for (i = 0; toks[i]; ++i)
free(toks[i]);
free(toks);
return 0;
}
Oh, and strdup:
/* strdup isn't ANSI C, so here's one...
*/
char* strdup(const char* str)
{
size_t len = strlen(str) + 1;
char* dup = malloc(len);
if (dup)
memcpy(dup, str, len);
return dup;
}
A few things to note about the code of the tokenize function:
strtok has the impolite habit of writing over its input string. To save the user's data, I only call it on a duplicate of the input. The duplicate is obtained using strdup.
strdup isn't ANSI-C, however, so I had to write one
The toks array is grown dynamically with realloc, since we have no idea in advance how many tokens there will be. The initial size is 2 just for testing, in real-life code I would probably set it to a much higher value. It's also returned to the user, and the user has to deallocate it after use.
In all cases, extreme care is taken not to leak resources. For example, if realloc returns NULL, it won't run over the old pointer. The old pointer will be released and the function returns. No resources leak when tokenize returns (except in the nominal case where the array returned to the user must be deallocated after use).
A goto is used for more convenient cleanup code, according to the philosophy that goto can be good in some cases (this is a good example, IMHO).
The following function can help with simple deallocation in a single call:
/* Given a pointer to the tokens array returned by 'tokenize',
** frees the array and sets it to point to NULL.
*/
void tokenize_free(char*** toks)
{
if (toks && *toks)
{
int i;
for (i = 0; (*toks)[i]; ++i)
free((*toks)[i]);
free(*toks);
*toks = 0;
}
}
I'd really like to discuss this code with other users of SO. What could've been done better? Would you recommend a difference interface to such a tokenizer? How is the burden of deallocation taken from the user? Are there memory leaks in the code anyway?
Thanks in advance
One thing I would recommend is to provide tokenize_free that handles all the deallocations. It's easier on the user and gives you the flexibility to change your allocation strategy in the future without breaking users of your library.
The code below fails when the first character of the string is a separator:
One additional idea is not to bother duplicating each individual token. I don't see what it adds and just gives you more places where the code can file. Instead, just keep the duplicate of the full buffer you made. What I mean is change:
toks[ntok] = strdup(cur_tok);
if (!toks[ntok])
goto cleanup_exit;
to:
toks[ntok] = cur_tok;
Drop the line free(buf) from the non-error path. Finally, this changes cleanup to:
free(toks[0]);
free(toks);
You don't need to strdup() each token; you duplicate the input string, and could let strtok() chop that up. It simplifies releasing the resources afterwards, too - you only have to release the array of pointers and the single string.
I agree with those who say that you need a function to release the data - unless you change the interface radically and have the user provide the array of pointers as an input parameter, and then you would probably also decide that the user is responsible for duplicating the string if it must be preserved. That leads to an interface:
int tokenize(char *source, const char *sep, char **tokens, size_t max_tokens);
The return value would be the number of tokens found.
You have to decide what to do when there are more tokens than slots in the array. Options include:
returning an error indication (negative number, likely -1), or
the full number of tokens found but the pointers that can't be assigned aren't, or
just the number of tokens that fitted, or
one more than the number of tokens, indicating that there were more, but no information on exactly how many more.
I chose to return '-1', and it lead to this code:
/*
#(#)File: $RCSfile: tokenise.c,v $
#(#)Version: $Revision: 1.9 $
#(#)Last changed: $Date: 2008/02/11 08:44:50 $
#(#)Purpose: Tokenise a string
#(#)Author: J Leffler
#(#)Copyright: (C) JLSS 1987,1989,1991,1997-98,2005,2008
#(#)Product: :PRODUCT:
*/
/*TABSTOP=4*/
/*
** 1. A token is 0 or more characters followed by a terminator or separator.
** The terminator is ASCII NUL '\0'. The separators are user-defined.
** 2. A leading separator is preceded by a zero-length token.
** A trailing separator is followed by a zero-length token.
** 3. The number of tokens found is returned.
** The list of token pointers is terminated by a NULL pointer.
** 4. The routine returns 0 if the arguments are invalid.
** It returns -1 if too many tokens were found.
*/
#include "jlss.h"
#include <string.h>
#define NO 0
#define YES 1
#define IS_SEPARATOR(c,s,n) (((c) == *(s)) || ((n) > 1 && strchr((s),(c))))
#define DIM(x) (sizeof(x)/sizeof(*(x)))
#ifndef lint
/* Prevent over-aggressive optimizers from eliminating ID string */
const char jlss_id_tokenise_c[] = "#(#)$Id: tokenise.c,v 1.9 2008/02/11 08:44:50 jleffler Exp $";
#endif /* lint */
int tokenise(
char *str, /* InOut: String to be tokenised */
char *sep, /* In: Token separators */
char **token, /* Out: Pointers to tokens */
int maxtok, /* In: Maximum number of tokens */
int nulls) /* In: Are multiple separators OK? */
{
int c;
int n_tokens;
int tokenfound;
int n_sep = strlen(sep);
if (n_sep <= 0 || maxtok <= 2)
return(0);
n_tokens = 1;
*token++ = str;
while ((c = *str++) != '\0')
{
tokenfound = NO;
while (c != '\0' && IS_SEPARATOR(c, sep, n_sep))
{
tokenfound = YES;
*(str - 1) = '\0';
if (nulls)
break;
c = *str++;
}
if (tokenfound)
{
if (++n_tokens >= maxtok - 1)
return(-1);
if (nulls)
*token++ = str;
else
*token++ = str - 1;
}
if (c == '\0')
break;
}
*token++ = 0;
return(n_tokens);
}
#ifdef TEST
struct
{
char *sep;
int nulls;
} data[] =
{
{ "/.", 0 },
{ "/.", 1 },
{ "/", 0 },
{ "/", 1 },
{ ".", 0 },
{ ".", 1 },
{ "", 0 }
};
static char string[] = "/fred//bill.c/joe.b/";
int main(void)
{
int i;
int j;
int n;
char input[100];
char *token[20];
for (i = 0; i < DIM(data); i++)
{
strcpy(input, string);
printf("\n\nTokenising <<%s>> using <<%s>>, null %d\n",
input, data[i].sep, data[i].nulls);
n = tokenise(input, data[i].sep, token, DIM(token),
data[i].nulls);
printf("Return value = %d\n", n);
for (j = 0; j < n; j++)
printf("Token %d: <<%s>>\n", j, token[j]);
if (n > 0)
printf("Token %d: 0x%08lX\n", n, (unsigned long)token[n]);
}
return(0);
}
#endif /* TEST */
I don't see anything wrong with the strtok approach to modifying a string in-line - it's the callers choice if they want to operate on a duplicated string or not as the semantics are well understood. Below is the same method slightly simplified to use strtok as intended, yet still return a handy array of char * pointers (which now simply point to the tokenized segments of the original string). It gives the same output for your original main() call.
The main advantage of this approach is that you only have to free the returned character array, instead of looping through to clear all of the elements - an aspect which I thought took away a lot of the simplicity factor and something a caller would be very unlikely to expect to do by any normal C convention.
I also took out the goto statements, because with the code refactored they just didn't make much sense to me. I think the danger of having a single cleanup point is that it can start to grow too unwieldy and do extra steps that are not needed to clean up issues at specific locations.
Personally I think the main philosophical point I would make is that you should respect what other people using the language are going to expect, especially when creating library kinds of calls. Even if the strtok replacement behavior seems odd to you, the vast majority of C programmers are used to placing \0 in the middle of C strings to split them up or create shorter strings and so this will seem quite natural. Also as noted no-one is going to expect to do anything beyond a single free() with the return value from a function. You need to write your code in whatever way needed to make sure then that the code works that way, as people will simply not read any documentation you might offer and will instead act according to the memory convention of your return value (which is char ** so a caller would expect to have to free that).
char** tokenize(char* input, const char* sep)
{
/* Size of the 'toks' array. Starts low and is doubled when
** exhausted.
*/
size_t size = 4;
/* 'ntok' points to the next free element of the 'toks' array
*/
size_t ntok = 0;
/* This is the array filled with tokens and returned
*/
char** toks = malloc(size * sizeof(*toks));
if ( toks == NULL )
return;
toks[ntok] = strtok( input, sep );
/* While we have more tokens to process...
*/
do
{
/* We should still have 2 empty elements in the array,
** one for this token and one for the sentinel.
*/
if (ntok > size - 2)
{
char** newtoks;
size *= 2;
newtoks = realloc(toks, size * sizeof(*toks));
if (newtoks == NULL)
{
free(toks);
return NULL;
}
toks = newtoks;
}
ntok++;
toks[ntok] = strtok(0, sep);
} while (toks[ntok]);
return toks;
}
Just a few things:
Using gotos is not intrinsically evil or bad, much like the preprocessor, they are often abused. In cases like yours where you have to exit a function differently depending on how things went, they are appropriate.
Provide a functional means of freeing the returned array. I.e. tok_free(pointer).
Use the re-entrant version of strtok() initially, i.e. strtok_r(). It would not be cumbersome for someone to pass an additional argument (even NULL if not needed) for that.
there is a great tools to detect Memory leak which is called Valgrind.
http://valgrind.org/
If you want to find memory leaks, one possibility is to run it with valgrind.

Resources