Segfault after strsep only when compiling with clang 10 - c

I am writing a parser (for NMEA sentences) which splits a string on commas using strsep. When compiled with clang (Apple LLVM version 10.0.1), the code segfaults when splitting a string which has an even number of tokens. When compiled with clang (version 7.0.1) or gcc (9.1.1) on Linux the code works correctly.
A stripped down version of the code which exhibits the issue is as follows:
#include <stdio.h>
#include <stdint.h>
#include <string.h>
static void gnss_parse_gsa (uint8_t argc, char **argv)
{
}
/**
* Desciptor for a NMEA sentence parser
*/
struct gps_parser_t {
void (*parse)(uint8_t, char**);
const char *type;
};
/**
* List of avaliable NMEA sentence parsers
*/
static const struct gps_parser_t nmea_parsers[] = {
{.parse = gnss_parse_gsa, .type = "GPGSA"}
};
static void gnss_line_callback (char *line)
{
/* Count the number of comma seperated tokens in the line */
uint8_t num_args = 1;
for (uint16_t i = 0; i < strlen(line); i++) {
num_args += (line[i] == ',');
}
/* Tokenize the sentence */
char *args[num_args];
for (uint16_t i = 0; (args[i] = strsep(&line, ",")) != NULL; i++);
/* Run parser for received sentence */
uint8_t num_parsers = sizeof(nmea_parsers)/sizeof(nmea_parsers[0]);
for (int i = 0; i < num_parsers; i++) {
if (!strcasecmp(args[0] + 1, nmea_parsers[i].type)) {
nmea_parsers[i].parse(num_args, args);
break;
}
}
}
int main (int argc, char **argv)
{
char pgsa_str[] = "$GPGSA,A,3,02,12,17,03,19,23,06,,,,,,1.41,1.13,0.85*03";
gnss_line_callback(pgsa_str);
}
The segfault occurs at on the line if (!strcasecmp(args[0] + 1, nmea_parsers[i].type)) {, the index operation on args attempts to deference a null pointer.
Increasing the size of the stack, either by manually editing the assembly or adding a call to printf("") anywhere in the function makes it no longer segfault, as does making the args array bigger (eg. adding one to num_args).
In summary, any of the following items prevent the segfault:
- Using a compiler other than clang 10
- Modifying the assembly to make the stack size before dynamic allocation 80 bytes or more (compiles to 64)
- Using an input string with an odd number of tokens
- Allocating args as a fixed length array with the correct number of tokens (or more)
- Allocating args as a variable length array with at least num_args + 1 elements
Note that when compiled with clang 7 on Linux the stack size before dynamic allocation is still 64 bytes, but the code does not segfault.
I'm hoping that someone might be able to explain why this happens, and if there is any way I can get this code to compile correctly with clang 10.

When all sorts of barely-relevant factors like the specific version of the compiler seem to make a difference, it's a pretty sure sign you've got undefined behavior somewhere.
You correctly count the commas to predetermine the exact number of fields, num_args. You allocate an array just barely big enough to hold those fields:
char *args[num_args];
But then you run this loop:
for (uint16_t i = 0; (args[i] = strsep(&line, ",")) != NULL; i++);
There are going to be num_args number of trips through this loop where strsep returns non-NULL pointers that get filled in to args[0] through args[num_args-1], which is what you intended, and which is fine. But then there's one more call to strsep, the one that returns NULL and terminates the loop -- but that null pointer also gets stored into the args array also, specifically into args[num_args], which is one cell off the end. Array overflow, in other words.
There are two ways to fix this. You can use an additional variable so you can capture and test strsep's return value before storing it into the args array:
char *p;
for (uint16_t i = 0; (p = strsep(&line, ",")) != NULL; i++)
args[i] = p;
This also has the side benefit that you have a more conventional loop, with an actual body.
Or, you can declare the args array one bigger than it strictly has to be, meaning that it's got room for that last, NULL pointer stored in args[num_args]:
char *args[num_args+1];
This has the side benefit that you always pass a "NULL terminated array" to the parsing functions, which can be handy for them (and which ends up matching, as it happens, the way main gets called).

Related

Copying strings from extern char environ in C

I have a question pertaining to the extern char **environ. I'm trying to make a C program that counts the size of the environ list, copies it to an array of strings (array of array of chars), and then sorts it alphabetically with a bubble sort. It will print in name=value or value=name order depending on the format value.
I tried using strncpy to get the strings from environ to my new array, but the string values come out empty. I suspect I'm trying to use environ in a way I can't, so I'm looking for help. I've tried to look online for help, but this particular program is very limited. I cannot use system(), yet the only help I've found online tells me to make a program to make this system call. (This does not help).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
extern char **environ;
int main(int argc, char *argv[])
{
char **env = environ;
int i = 0;
int j = 0;
printf("Hello world!\n");
int listSZ = 0;
char temp[1024];
while(env[listSZ])
{
listSZ++;
}
printf("DEBUG: LIST SIZE = %d\n", listSZ);
char **list = malloc(listSZ * sizeof(char**));
char **sorted = malloc(listSZ * sizeof(char**));
for(i = 0; i < listSZ; i++)
{
list[i] = malloc(sizeof(env[i]) * sizeof(char)); // set the 2D Array strings to size 80, for good measure
sorted[i] = malloc(sizeof(env[i]) * sizeof(char));
}
while(env[i])
{
strncpy(list[i], env[i], sizeof(env[i]));
i++;
} // copy is empty???
for(i = 0; i < listSZ - 1; i++)
{
for(j = 0; j < sizeof(list[i]); j++)
{
if(list[i][j] > list[i+1][j])
{
strcpy(temp, list[i]);
strcpy(list[i], list[i+1]);
strcpy(list[i+1], temp);
j = sizeof(list[i]); // end loop, we resolved this specific entry
}
// else continue
}
}
This is my code, help is greatly appreciated. Why is this such a hard to find topic? Is it the lack of necessity?
EDIT: Pasted wrong code, this was a separate .c file on the same topic, but I started fresh on another file.
In a unix environment, the environment is a third parameter to main.
Try this:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main(int argc, char *argv[], char **envp)
{
while (*envp) {
printf("%s\n", *envp);
*envp++;
}
}
There are multiple problems with your code, including:
Allocating the 'wrong' size for list and sorted (you multiply by sizeof(char **), but should be multiplying by sizeof(char *) because you're allocating an array of char *. This bug won't actually hurt you this time. Using sizeof(*list) avoids the problem.
Allocating the wrong size for the elements in list and sorted. You need to use strlen(env[i]) + 1 for the size, remembering to allow for the null that terminates the string.
You don't check the memory allocations.
Your string copying loop is using strncpy() and shouldn't (actually, you should seldom use strncpy()), not least because it is only copying 4 or 8 bytes of each environment variable (depending on whether you're on a 32-bit or 64-bit system), and it is not ensuring that they're null terminated strings (just one of the many reasons for not using strncpy().
Your outer loop of your 'sorting' code is OK; your inner loop is 100% bogus because you should be using the length of one or the other string, not the size of the pointer, and your comparisons are on single characters, but you're then using strcpy() where you simply need to move pointers around.
You allocate but don't use sorted.
You don't print the sorted environment to demonstrate that it is sorted.
Your code is missing the final }.
Here is some simple code that uses the standard C library qsort() function to do the sorting, and simulates POSIX strdup()
under the name dup_str() — you could use strdup() if you have POSIX available to you.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
extern char **environ;
/* Can also be spelled strdup() and provided by the system */
static char *dup_str(const char *str)
{
size_t len = strlen(str) + 1;
char *dup = malloc(len);
if (dup != NULL)
memmove(dup, str, len);
return dup;
}
static int cmp_str(const void *v1, const void *v2)
{
const char *s1 = *(const char **)v1;
const char *s2 = *(const char **)v2;
return strcmp(s1, s2);
}
int main(void)
{
char **env = environ;
int listSZ;
for (listSZ = 0; env[listSZ] != NULL; listSZ++)
;
printf("DEBUG: Number of environment variables = %d\n", listSZ);
char **list = malloc(listSZ * sizeof(*list));
if (list == NULL)
{
fprintf(stderr, "Memory allocation failed!\n");
exit(EXIT_FAILURE);
}
for (int i = 0; i < listSZ; i++)
{
if ((list[i] = dup_str(env[i])) == NULL)
{
fprintf(stderr, "Memory allocation failed!\n");
exit(EXIT_FAILURE);
}
}
qsort(list, listSZ, sizeof(list[0]), cmp_str);
for (int i = 0; i < listSZ; i++)
printf("%2d: %s\n", i, list[i]);
return 0;
}
Other people pointed out that you can get at the environment via a third argument to main(), using the prototype int main(int argc, char **argv, char **envp). Note that Microsoft explicitly supports this. They're correct, but you can also get at the environment via environ, even in functions other than main(). The variable environ is unique amongst the global variables defined by POSIX in not being declared in any header file, so you must write the declaration yourself.
Note that the memory allocation is error checked and the error reported on standard error, not standard output.
Clearly, if you like writing and debugging sort algorithms, you can avoid using qsort(). Note that string comparisons need to be done using strcmp(), but you can't use strcmp() directly with qsort() when you're sorting an array of pointers because the argument types are wrong.
Part of the output for me was:
DEBUG: Number of environment variables = 51
0: Apple_PubSub_Socket_Render=/private/tmp/com.apple.launchd.tQHOVHUgys/Render
1: BASH_ENV=/Users/jleffler/.bashrc
2: CDPATH=:/Users/jleffler:/Users/jleffler/src:/Users/jleffler/src/perl:/Users/jleffler/src/sqltools:/Users/jleffler/lib:/Users/jleffler/doc:/Users/jleffler/work:/Users/jleffler/soq/src
3: CLICOLOR=1
4: DBDATE=Y4MD-
…
47: VISUAL=vim
48: XPC_FLAGS=0x0
49: XPC_SERVICE_NAME=0
50: _=./pe17
If you want to sort the values instead of the names, you have to do some harder work. You'd need to define what output you wish to see. There are multiple ways of handling that sort.
To get the environment variables, you need to declare main like this:
int main(int argc, char **argv, char **env);
The third parameter is the NULL-terminated list of environment variables. See:
#include <stdio.h>
int main(int argc, char **argv, char **environ)
{
for(size_t i = 0; env[i]; ++i)
puts(environ[i]);
return 0;
}
The output of this is:
LD_LIBRARY_PATH=/home/shaoran/opt/node-v6.9.4-linux-x64/lib:
LS_COLORS=rs=0:di=01;34:ln=01;36:m
...
Note also that sizeof(environ[i]) in your code does not get you the length of
the string, it gets you the size of a pointer, so
strncpy(list[i], environ[i], sizeof(environ[i]));
is wrong. Also the whole point of strncpy is to limit based on the destination,
not on the source, otherwise if the source is larger than the destination, you
will still overflow the buffer. The correct call would be
strncpy(list[i], environ[i], 80);
list[i][79] = 0;
Bare in mind that strncpy might not write the '\0'-terminating byte if the
destination is not large enough, so you have to make sure to terminate the
string. Also note that 79 characters might be too short for storing env variables. For example, my LS_COLORS variable
is huge, at least 1500 characters long. You might want to do your list[i] = malloc calls based based on strlen(environ[i])+1.
Another thing: your swapping
strcpy(temp, list[i]);
strcpy(list[i], list[i+1]);
strcpy(list[i+1], temp);
j = sizeof(list[i]);
works only if all list[i] point to memory of the same size. Since the list[i] are pointers, the cheaper way of swapping would be by
swapping the pointers instead:
char *tmp = list[i];
list[i] = list[i+1];
list[i+1] = tmp;
This is more efficient, is a O(1) operation and you don't have to worry if the
memory spaces are not of the same size.
What I don't get is, what do you intend with j = sizeof(list[i])? Not only
that sizeof(list[i]) returns you the size of a pointer (which will be constant
for all list[i]), why are you messing with the running variable j inside the
block? If you want to leave the loop, the do break. And you are looking for
strlen(list[i]): this will give you the length of the string.

Unable to pin down a bug in a simple multithreading program

I am working on a project that involves multi-threading. Although I have a decent understanding of multi-threading, I have not written many such codes. The following code is just a simple code that I wrote for hands-on. It works fine when compiled with gcc -pthread.
To build upon this code, I need to include some libraries that already have pthread included and linked. If I compile by including and linking those libraries, 3 out of 5 times it gives segmentation fault. There is some problem with the first for-loop in main() -- replacing this for-loop with multiple statements does the work.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <pthread.h>
#define NUM_THREADS 3
pthread_mutex_t m_lock = PTHREAD_MUTEX_INITIALIZER;
typedef struct{
int id;
char ip[20];
} thread_data;
void *doOperation(void* ctx)
{
pthread_mutex_lock(&m_lock);
thread_data *m_ctx = (thread_data *)ctx;
printf("Reached here\n");
pthread_mutex_unlock(&m_lock);
pthread_exit(NULL);
}
int main()
{
thread_data ctx[NUM_THREADS];
pthread_t threads[NUM_THREADS];
for (int i = 0; i < NUM_THREADS; ++i)
{
char ip_n[] = "127.0.0.";
char ip_h[4];
sprintf(ip_h, "%d", i+1);
strcpy(ctx[i].ip, strcat(ip_n, ip_h));
}
for (int i = 0; i < NUM_THREADS; ++i)
{
pthread_create(&threads[i], NULL, doOperation, (void *)&ctx[i])
}
for (int i = 0; i < NUM_THREADS; ++i)
{
pthread_join(threads[i], NULL);
}
pthread_exit(NULL);
}
You say the code you present "works fine", but it is buggy. In particular, the first for loop is buggy, so it is not surprising that it gives you trouble under some circumstances. Here's a breakdown:
char ip_n[] = "127.0.0.";
You have declared ip_n as an array of char exactly long enough to hold the given initializer, including its terminating null character.
char ip_h[4];
sprintf(ip_h, "%d", i+1);
Supposing that the sprintf() succeeds, you have written a non-empty string into char array ip_h.
strcpy(ctx[i].ip, strcat(ip_n, ip_h));
You attempt via strcat() to append the contents of ip_h to the end of ip_n, but there is no room -- this writes past the bounds of ip_n, producing undefined behavior.
The easiest way to solve this would probably be to declare ip_n with an explicit length that is sufficient for the full data. In general, a dotted-quad IP address string might require as many as 16 bytes, including the terminator:
char ip_n[16] = "127.0.0.";
You can't strcat(ip_n, ip_h) because the array ip_n is only big enough to hold the string "127.0.0.". Here's what the man page says, emphasis added
The strcat() and strncat() functions append a copy of the
null-terminated string s2 to the end of the null-terminated string
s1, then add a terminating `\0'. The string s1 must have sufficient
space to hold the result.
The declaration should be
char ip_n[20] = "127.0.0.";
I just added ; at the end of pthread_create(&threads[i], NULL, doOperation, (void *)&ctx[i])
And this segmentation fault might be because of
char ip_n[] = "127.0.0.";
Above, the sizeof(ip_n) only returns 9. But you need atleast 10 characters to store string like 127.0.0.3 (Including null character at the end). Unauthorized memory access might result in segmentation fault. Try replacing this with char ip_n[10] = "127.0.0.";
ip_n[] points to a constant; the compiler reserved 9 bytes including the NULL byte. Anything after these 9 bytes should not be accessed and if you do, the result is undefined (it could work some of the time but likely not all the time). When you do:
strcat(ip_n, ip_h)
You are overflowing the buffer pointed to by ip_n. Maybe this is what is causing your problem. If it's not, I still recommend fixing this.

best practice for returning a variable length string in c

I have a string function that accepts a pointer to a source string and returns a pointer to a destination string. This function currently works, but I'm worried I'm not following the best practice regrading malloc, realloc, and free.
The thing that's different about my function is that the length of the destination string is not the same as the source string, so realloc() has to be called inside my function. I know from looking at the docs...
http://www.cplusplus.com/reference/cstdlib/realloc/
that the memory address might change after the realloc. This means I have can't "pass by reference" like a C programmer might for other functions, I have to return the new pointer.
So the prototype for my function is:
//decode a uri encoded string
char *net_uri_to_text(char *);
I don't like the way I'm doing it because I have to free the pointer after running the function:
char * chr_output = net_uri_to_text("testing123%5a%5b%5cabc");
printf("%s\n", chr_output); //testing123Z[\abc
free(chr_output);
Which means that malloc() and realloc() are called inside my function and free() is called outside my function.
I have a background in high level languages, (perl, plpgsql, bash) so my instinct is proper encapsulation of such things, but that might not be the best practice in C.
The question: Is my way best practice, or is there a better way I should follow?
full example
Compiles and runs with two warnings on unused argc and argv arguments, you can safely ignore those two warnings.
example.c:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char *net_uri_to_text(char *);
int main(int argc, char ** argv) {
char * chr_input = "testing123%5a%5b%5cabc";
char * chr_output = net_uri_to_text(chr_input);
printf("%s\n", chr_output);
free(chr_output);
return 0;
}
//decodes uri-encoded string
//send pointer to source string
//return pointer to destination string
//WARNING!! YOU MUST USE free(chr_result) AFTER YOU'RE DONE WITH IT OR YOU WILL GET A MEMORY LEAK!
char *net_uri_to_text(char * chr_input) {
//define variables
int int_length = strlen(chr_input);
int int_new_length = int_length;
char * chr_output = malloc(int_length);
char * chr_output_working = chr_output;
char * chr_input_working = chr_input;
int int_output_working = 0;
unsigned int uint_hex_working;
//while not a null byte
while(*chr_input_working != '\0') {
//if %
if (*chr_input_working == *"%") {
//then put correct char in
sscanf(chr_input_working + 1, "%02x", &uint_hex_working);
*chr_output_working = (char)uint_hex_working;
//printf("special char:%c, %c, %d<\n", *chr_output_working, (char)uint_hex_working, uint_hex_working);
//realloc
chr_input_working++;
chr_input_working++;
int_new_length -= 2;
chr_output = realloc(chr_output, int_new_length);
//output working must be the new pointer plys how many chars we've done
chr_output_working = chr_output + int_output_working;
} else {
//put char in
*chr_output_working = *chr_input_working;
}
//increment pointers and number of chars in output working
chr_input_working++;
chr_output_working++;
int_output_working++;
}
//last null byte
*chr_output_working = '\0';
return chr_output;
}
It's perfectly ok to return malloc'd buffers from functions in C, as long as you document the fact that they do. Lots of libraries do that, even though no function in the standard library does.
If you can compute (a not too pessimistic upper bound on) the number of characters that need to be written to the buffer cheaply, you can offer a function that does that and let the user call it.
It's also possible, but much less convenient, to accept a buffer to be filled in; I've seen quite a few libraries that do that like so:
/*
* Decodes uri-encoded string encoded into buf of length len (including NUL).
* Returns the number of characters written. If that number is less than len,
* nothing is written and you should try again with a larger buffer.
*/
size_t net_uri_to_text(char const *encoded, char *buf, size_t len)
{
size_t space_needed = 0;
while (decoding_needs_to_be_done()) {
// decode characters, but only write them to buf
// if it wouldn't overflow;
// increment space_needed regardless
}
return space_needed;
}
Now the caller is responsible for the allocation, and would do something like
size_t len = SOME_VALUE_THAT_IS_USUALLY_LONG_ENOUGH;
char *result = xmalloc(len);
len = net_uri_to_text(input, result, len);
if (len > SOME_VALUE_THAT_IS_USUALLY_LONG_ENOUGH) {
// try again
result = xrealloc(input, result, len);
}
(Here, xmalloc and xrealloc are "safe" allocating functions that I made up to skip NULL checks.)
The thing is that C is low-level enough to force the programmer to get her memory management right. In particular, there's nothing wrong with returning a malloc()ated string. It's a common idiom to return mallocated obejcts and have the caller free() them.
And anyways, if you don't like this approach, you can always take a pointer to the string and modify it from inside the function (after the last use, it will still need to be free()d, though).
One thing, however, that I don't think is necessary is explicitly shrinking the string. If the new string is shorter than the old one, there's obviously enough room for it in the memory chunk of the old string, so you don't need to realloc().
(Apart from the fact that you forgot to allocate one extra byte for the terminating NUL character, of course...)
And, as always, you can just return a different pointer each time the function is called, and you don't even need to call realloc() at all.
If you accept one last piece of good advice: it's advisable to const-qualify your input strings, so the caller can ensure that you don't modify them. Using this approach, you can safely call the function on string literals, for example.
All in all, I'd rewrite your function like this:
char *unescape(const char *s)
{
size_t l = strlen(s);
char *p = malloc(l + 1), *r = p;
while (*s) {
if (*s == '%') {
char buf[3] = { s[1], s[2], 0 };
*p++ = strtol(buf, NULL, 16); // yes, I prefer this over scanf()
s += 3;
} else {
*p++ = *s++;
}
}
*p = 0;
return r;
}
And call it as follows:
int main()
{
const char *in = "testing123%5a%5b%5cabc";
char *out = unescape(in);
printf("%s\n", out);
free(out);
return 0;
}
It's perfectly OK to return newly-malloc-ed (and possibly internally realloced) values from functions, you just need to document that you are doing so (as you do here).
Other obvious items:
Instead of int int_length you might want to use size_t. This is "an unsigned type" (usually unsigned int or unsigned long) that is the appropriate type for lengths of strings and arguments to malloc.
You need to allocate n+1 bytes initially, where n is the length of the string, as strlen does not include the terminating 0 byte.
You should check for malloc failing (returning NULL). If your function will pass the failure on, document that in the function-description comment.
sscanf is pretty heavy-weight for converting the two hex bytes. Not wrong, except that you're not checking whether the conversion succeeds (what if the input is malformed? you can of course decide that this is the caller's problem but in general you might want to handle that). You can use isxdigit from <ctype.h> to check for hexadecimal digits, and/or strtoul to do the conversion.
Rather than doing one realloc for every % conversion, you might want to do a final "shrink realloc" if desirable. Note that if you allocate (say) 50 bytes for a string and find it requires only 49 including the final 0 byte, it may not be worth doing a realloc after all.
I would approach the problem in a slightly different way. Personally, I would split your function in two. The first function to calculate the size you need to malloc. The second would write the output string to the given pointer (which has been allocated outside of the function). That saves several calls to realloc, and will keep the complexity the same. A possible function to find the size of the new string is:
int getNewSize (char *string) {
char *i = string;
int size = 0, percent = 0;
for (i, size; *i != '\0'; i++, size++) {
if (*i == '%')
percent++;
}
return size - percent * 2;
}
However, as mentioned in other answers there is no problem in returning a malloc'ed buffer as long as you document it!
Additionally what was already mentioned in the other postings, you should also document the fact that the string is reallocated. If your code is called with a static string or a string allocated with alloca, you may not reallocate it.
I think you are right to be concerned about splitting up mallocs and frees. As a rule, whatever makes it, owns it and should free it.
In this case, where the strings are relatively small, one good procedure is to make the string buffer larger than any possible string it could contain. For example, URLs have a de facto limit of about 2000 characters, so if you malloc 10000 characters you can store any possible URL.
Another trick is to store both the length and capacity of the string at its front, so that (int)*mystring == length of string and (int)*(mystring + 4) == capacity of string. Thus, the string itself only starts at the 8th position *(mystring+8). By doing this you can pass around a single pointer to a string and always know how long it is and how much memory capacity the string has. You can make macros that automatically generate these offsets and make "pretty code".
The value of using buffers this way is you do not need to do a reallocation. The new value overwrites the old value and you update the length at the beginning of the string.

Simple C string manipulation

I trying to do some very basic string processing in C (e.g. given a filename, chop off the file extension, manipulate filename and then add back on the extension)- I'm rather rusty on C and am getting segmentation faults.
char* fname;
char* fname_base;
char* outdir;
char* new_fname;
.....
fname = argv[1];
outdir = argv[2];
fname_len = strlen(fname);
strncpy(fname_base, fname, (fname_len-4)); // weird characters at the end of the truncation?
strcpy(new_fname, outdir); // getting a segmentation on this I think
strcat(new_fname, "/");
strcat(new_fname, fname_base);
strcat(new_fname, "_test");
strcat(new_fname, ".jpg");
printf("string=%s",new_fname);
Any suggestions or pointers welcome.
Many thanks and apologies for such a basic question
You need to allocate memory for new_fname and fname_base. Here's is how you would do it for new_fname:
new_fname = (char*)malloc((strlen(outdir)+1)*sizeof(char));
In strlen(outdir)+1, the +1 part is for allocating memory for the NULL CHARACTER '\0' terminator.
In addition to what other's are indicating, I would be careful with
strncpy(fname_base, fname, (fname_len-4));
You are assuming you want to chop off the last 4 characters (.???). If there is no file extension or it is not 3 characters, this will not do what you want. The following should give you an idea of what might be needed (I assume that the last '.' indicates the file extension). Note that my 'C' is very rusty (warning!)
char *s;
s = (char *) strrchr (fname, '.');
if (s == 0)
{
strcpy (fname_base, fname);
}
else
{
strncpy (fname_base, fname, strlen(fname)-strlen(s));
fname_base[strlen(fname)-strlen(s)] = 0;
}
You have to malloc fname_base and new_fname, I believe.
ie:
fname_base = (char *)(malloc(sizeof(char)*(fname_len+1)));
fname_base[fname_len] = 0; //to stick in the null termination
and similarly for new_fname and outdir
You're using uninitialized pointers as targets for strcpy-like functions: fname_base and new_fname: you need to allocate memory areas to work on, or declare them as char array e.g.
char fname_base[FILENAME_MAX];
char new_fname[FILENAME_MAX];
you could combine the malloc that has been suggested, with the string manipulations in one statement
if ( asprintf(&new_fname,"%s/%s_text.jpg",outdir,fname_base) >= 0 )
// success, else failed
then at some point, free(new_fname) to release the memory.
(note this is a GNU extension which is also available in *BSD)
Cleaner code:
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
const char *extra = "_test.jpg";
int main(int argc, char** argv)
{
char *fname = strdup(argv[1]); /* duplicate, we need to truncate the dot */
char *outdir = argv[1];
char *dotpos;
/* ... */
int new_size = strlen(fname)+strlen(extra);
char *new_fname = malloc(new_size);
dotpos = strchr(fname, '.');
if(dotpos)
*dotpos = '\0'; /* truncate at the dot */
new_fname = malloc(new_size);
snprintf(new_fname, new_size, "%s%s", fname, extra);
printf("%s\n", new_fname);
return 0;
}
In the following code I do not call malloc.
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
/* Change this to '\\' if you are doing this on MS-windows or something like it. */
#define DIR_SYM '/'
#define EXT_SYM '.'
#define NEW_EXT "jpg"
int main(int argc, char * argv[] ) {
char * fname;
char * outdir;
if (argc < 3) {
fprintf(stderr, "I want more command line arguments\n");
return 1;
}
fname = argv[1];
outdir = argv[2];
char * fname_base_begin = strrchr(fname, DIR_SYM); /* last occurrence of DIR_SYM */
if (!fname_base_begin) {
fname_base_begin = fname; // No directory symbol means that there's nothing
// to chop off of the front.
}
char * fname_base_end = strrchr(fname_base_begin, EXT_SYM);
/* NOTE: No need to search for EXT_SYM in part of the fname that we have cut off
* the front and then have to deal with finding the last EXT_SYM before the last
* DIR_SYM */
if (!fname_base_end) {
fprintf(stderr, "I don't know what you want to do when there is no extension\n");
return 1;
}
*fname_base_end = '\0'; /* Makes this an end of string instead of EXT_SYM */
/* NOTE: In this code I actually changed the string passed in with the previous
* line. This is often not what you want to do, but in this case it should be ok.
*/
// This line should get you the results I think you were trying for in your example
printf("string=%s%c%s_test%c%s\n", outdir, DIR_SYM, fname_base_begin, EXT_SYM, NEW_EXT);
// This line should just append _test before the extension, but leave the extension
// as it was before.
printf("string=%s%c%s_test%c%s\n", outdir, DIR_SYM, fname_base_begin, EXT_SYM, fname_base_end+1);
return 0;
}
I was able to get away with not allocating memory to build the string in because I let printf actually worry about building it, and took advantage of knowing that the original fname string would not be needed in the future.
I could have allocated the space for the string by calculating how long it would need to be based on the parts and then used sprintf to form the string for me.
Also, if you don't want to alter the contents of the fname string you could also have used:
printf("string=%s%c%*s_test%c%s\n", outdir, DIR_SYM, (unsigned)fname_base_begin -(unsigned)fname_base_end, fname_base_begin, EXT_SYM, fname_base_end+1);
To make printf only use part of the string.
The basic of any C string manipulation is that you must write into (and read from unless... ...) memory you "own". Declaring something is a pointer (type *x) reserves space for the pointer, not for the pointee that of course can't be known by magic, and so you have to malloc (or similar) or to provide a local buffer with things like char buf[size].
And you should be always aware of buffer overflow.
As suggested, the usage of sprintf (with a correctly allocated destination buffer) or alike could be a good idea. Anyway if you want to keep your current strcat approach, I remember you that to concatenate strings, strcat have always to "walk" thourgh the current string from its beginning, so that, if you don't need (ops!) buffer overflow checks of any kind, appending chars "by hand" is a bit faster: basically when you finished appending a string, you know where the new end is, and in the next strcat, you can start from there.
But strcat doesn't allow to know the address of the last char appended, and using strlen would nullify the effort. So a possible solution could be
size_t l = strlen(new_fname);
new_fname[l++] = '/';
for(i = 0; fname_base[i] != 0; i++, l++) new_fname[l] = fname_base[i];
for(i = 0; testjpgstring[i] != 0; i++, l++) new_fname[l] = testjpgstring[i];
new_fname[l] = 0; // terminate the string...
and you can continue using l... (testjpgstring = "_test.jpg")
However if your program is full of string manipulations, I suggest using a library for strings (for lazyness I often use glib)

The intricacy of a string tokenization function in C

For brushing up my C, I'm writing some useful library code. When it came to reading text files, it's always useful to have a convenient tokenization function that does most of the heavy lifting (looping on strtok is inconvenient and dangerous).
When I wrote this function, I'm amazed at its intricacy. To tell the truth, I'm almost convinced that it contains bugs (especially with memory leaks in case of an allocation error). Here's the code:
/* Given an input string and separators, returns an array of
** tokens. Each token is a dynamically allocated, NUL-terminated
** string. The last element of the array is a sentinel NULL
** pointer. The returned array (and all the strings in it) must
** be deallocated by the caller.
**
** In case of errors, NULL is returned.
**
** This function is much slower than a naive in-line tokenization,
** since it copies the input string and does many allocations.
** However, it's much more convenient to use.
*/
char** tokenize(const char* input, const char* sep)
{
/* strtok ruins its input string, so we'll work on a copy
*/
char* dup;
/* This is the array filled with tokens and returned
*/
char** toks = 0;
/* Current token
*/
char* cur_tok;
/* Size of the 'toks' array. Starts low and is doubled when
** exhausted.
*/
size_t size = 2;
/* 'ntok' points to the next free element of the 'toks' array
*/
size_t ntok = 0;
size_t i;
if (!(dup = strdup(input)))
return NULL;
if (!(toks = malloc(size * sizeof(*toks))))
goto cleanup_exit;
cur_tok = strtok(dup, sep);
/* While we have more tokens to process...
*/
while (cur_tok)
{
/* We should still have 2 empty elements in the array,
** one for this token and one for the sentinel.
*/
if (ntok > size - 2)
{
char** newtoks;
size *= 2;
newtoks = realloc(toks, size * sizeof(*toks));
if (!newtoks)
goto cleanup_exit;
toks = newtoks;
}
/* Now the array is definitely large enough, so we just
** copy the new token into it.
*/
toks[ntok] = strdup(cur_tok);
if (!toks[ntok])
goto cleanup_exit;
ntok++;
cur_tok = strtok(0, sep);
}
free(dup);
toks[ntok] = 0;
return toks;
cleanup_exit:
free(dup);
for (i = 0; i < ntok; ++i)
free(toks[i]);
free(toks);
return NULL;
}
And here's simple usage:
int main()
{
char line[] = "The quick brown fox jumps over the lazy dog";
char** toks = tokenize(line, " \t");
int i;
for (i = 0; toks[i]; ++i)
printf("%s\n", toks[i]);
/* Deallocate
*/
for (i = 0; toks[i]; ++i)
free(toks[i]);
free(toks);
return 0;
}
Oh, and strdup:
/* strdup isn't ANSI C, so here's one...
*/
char* strdup(const char* str)
{
size_t len = strlen(str) + 1;
char* dup = malloc(len);
if (dup)
memcpy(dup, str, len);
return dup;
}
A few things to note about the code of the tokenize function:
strtok has the impolite habit of writing over its input string. To save the user's data, I only call it on a duplicate of the input. The duplicate is obtained using strdup.
strdup isn't ANSI-C, however, so I had to write one
The toks array is grown dynamically with realloc, since we have no idea in advance how many tokens there will be. The initial size is 2 just for testing, in real-life code I would probably set it to a much higher value. It's also returned to the user, and the user has to deallocate it after use.
In all cases, extreme care is taken not to leak resources. For example, if realloc returns NULL, it won't run over the old pointer. The old pointer will be released and the function returns. No resources leak when tokenize returns (except in the nominal case where the array returned to the user must be deallocated after use).
A goto is used for more convenient cleanup code, according to the philosophy that goto can be good in some cases (this is a good example, IMHO).
The following function can help with simple deallocation in a single call:
/* Given a pointer to the tokens array returned by 'tokenize',
** frees the array and sets it to point to NULL.
*/
void tokenize_free(char*** toks)
{
if (toks && *toks)
{
int i;
for (i = 0; (*toks)[i]; ++i)
free((*toks)[i]);
free(*toks);
*toks = 0;
}
}
I'd really like to discuss this code with other users of SO. What could've been done better? Would you recommend a difference interface to such a tokenizer? How is the burden of deallocation taken from the user? Are there memory leaks in the code anyway?
Thanks in advance
One thing I would recommend is to provide tokenize_free that handles all the deallocations. It's easier on the user and gives you the flexibility to change your allocation strategy in the future without breaking users of your library.
The code below fails when the first character of the string is a separator:
One additional idea is not to bother duplicating each individual token. I don't see what it adds and just gives you more places where the code can file. Instead, just keep the duplicate of the full buffer you made. What I mean is change:
toks[ntok] = strdup(cur_tok);
if (!toks[ntok])
goto cleanup_exit;
to:
toks[ntok] = cur_tok;
Drop the line free(buf) from the non-error path. Finally, this changes cleanup to:
free(toks[0]);
free(toks);
You don't need to strdup() each token; you duplicate the input string, and could let strtok() chop that up. It simplifies releasing the resources afterwards, too - you only have to release the array of pointers and the single string.
I agree with those who say that you need a function to release the data - unless you change the interface radically and have the user provide the array of pointers as an input parameter, and then you would probably also decide that the user is responsible for duplicating the string if it must be preserved. That leads to an interface:
int tokenize(char *source, const char *sep, char **tokens, size_t max_tokens);
The return value would be the number of tokens found.
You have to decide what to do when there are more tokens than slots in the array. Options include:
returning an error indication (negative number, likely -1), or
the full number of tokens found but the pointers that can't be assigned aren't, or
just the number of tokens that fitted, or
one more than the number of tokens, indicating that there were more, but no information on exactly how many more.
I chose to return '-1', and it lead to this code:
/*
#(#)File: $RCSfile: tokenise.c,v $
#(#)Version: $Revision: 1.9 $
#(#)Last changed: $Date: 2008/02/11 08:44:50 $
#(#)Purpose: Tokenise a string
#(#)Author: J Leffler
#(#)Copyright: (C) JLSS 1987,1989,1991,1997-98,2005,2008
#(#)Product: :PRODUCT:
*/
/*TABSTOP=4*/
/*
** 1. A token is 0 or more characters followed by a terminator or separator.
** The terminator is ASCII NUL '\0'. The separators are user-defined.
** 2. A leading separator is preceded by a zero-length token.
** A trailing separator is followed by a zero-length token.
** 3. The number of tokens found is returned.
** The list of token pointers is terminated by a NULL pointer.
** 4. The routine returns 0 if the arguments are invalid.
** It returns -1 if too many tokens were found.
*/
#include "jlss.h"
#include <string.h>
#define NO 0
#define YES 1
#define IS_SEPARATOR(c,s,n) (((c) == *(s)) || ((n) > 1 && strchr((s),(c))))
#define DIM(x) (sizeof(x)/sizeof(*(x)))
#ifndef lint
/* Prevent over-aggressive optimizers from eliminating ID string */
const char jlss_id_tokenise_c[] = "#(#)$Id: tokenise.c,v 1.9 2008/02/11 08:44:50 jleffler Exp $";
#endif /* lint */
int tokenise(
char *str, /* InOut: String to be tokenised */
char *sep, /* In: Token separators */
char **token, /* Out: Pointers to tokens */
int maxtok, /* In: Maximum number of tokens */
int nulls) /* In: Are multiple separators OK? */
{
int c;
int n_tokens;
int tokenfound;
int n_sep = strlen(sep);
if (n_sep <= 0 || maxtok <= 2)
return(0);
n_tokens = 1;
*token++ = str;
while ((c = *str++) != '\0')
{
tokenfound = NO;
while (c != '\0' && IS_SEPARATOR(c, sep, n_sep))
{
tokenfound = YES;
*(str - 1) = '\0';
if (nulls)
break;
c = *str++;
}
if (tokenfound)
{
if (++n_tokens >= maxtok - 1)
return(-1);
if (nulls)
*token++ = str;
else
*token++ = str - 1;
}
if (c == '\0')
break;
}
*token++ = 0;
return(n_tokens);
}
#ifdef TEST
struct
{
char *sep;
int nulls;
} data[] =
{
{ "/.", 0 },
{ "/.", 1 },
{ "/", 0 },
{ "/", 1 },
{ ".", 0 },
{ ".", 1 },
{ "", 0 }
};
static char string[] = "/fred//bill.c/joe.b/";
int main(void)
{
int i;
int j;
int n;
char input[100];
char *token[20];
for (i = 0; i < DIM(data); i++)
{
strcpy(input, string);
printf("\n\nTokenising <<%s>> using <<%s>>, null %d\n",
input, data[i].sep, data[i].nulls);
n = tokenise(input, data[i].sep, token, DIM(token),
data[i].nulls);
printf("Return value = %d\n", n);
for (j = 0; j < n; j++)
printf("Token %d: <<%s>>\n", j, token[j]);
if (n > 0)
printf("Token %d: 0x%08lX\n", n, (unsigned long)token[n]);
}
return(0);
}
#endif /* TEST */
I don't see anything wrong with the strtok approach to modifying a string in-line - it's the callers choice if they want to operate on a duplicated string or not as the semantics are well understood. Below is the same method slightly simplified to use strtok as intended, yet still return a handy array of char * pointers (which now simply point to the tokenized segments of the original string). It gives the same output for your original main() call.
The main advantage of this approach is that you only have to free the returned character array, instead of looping through to clear all of the elements - an aspect which I thought took away a lot of the simplicity factor and something a caller would be very unlikely to expect to do by any normal C convention.
I also took out the goto statements, because with the code refactored they just didn't make much sense to me. I think the danger of having a single cleanup point is that it can start to grow too unwieldy and do extra steps that are not needed to clean up issues at specific locations.
Personally I think the main philosophical point I would make is that you should respect what other people using the language are going to expect, especially when creating library kinds of calls. Even if the strtok replacement behavior seems odd to you, the vast majority of C programmers are used to placing \0 in the middle of C strings to split them up or create shorter strings and so this will seem quite natural. Also as noted no-one is going to expect to do anything beyond a single free() with the return value from a function. You need to write your code in whatever way needed to make sure then that the code works that way, as people will simply not read any documentation you might offer and will instead act according to the memory convention of your return value (which is char ** so a caller would expect to have to free that).
char** tokenize(char* input, const char* sep)
{
/* Size of the 'toks' array. Starts low and is doubled when
** exhausted.
*/
size_t size = 4;
/* 'ntok' points to the next free element of the 'toks' array
*/
size_t ntok = 0;
/* This is the array filled with tokens and returned
*/
char** toks = malloc(size * sizeof(*toks));
if ( toks == NULL )
return;
toks[ntok] = strtok( input, sep );
/* While we have more tokens to process...
*/
do
{
/* We should still have 2 empty elements in the array,
** one for this token and one for the sentinel.
*/
if (ntok > size - 2)
{
char** newtoks;
size *= 2;
newtoks = realloc(toks, size * sizeof(*toks));
if (newtoks == NULL)
{
free(toks);
return NULL;
}
toks = newtoks;
}
ntok++;
toks[ntok] = strtok(0, sep);
} while (toks[ntok]);
return toks;
}
Just a few things:
Using gotos is not intrinsically evil or bad, much like the preprocessor, they are often abused. In cases like yours where you have to exit a function differently depending on how things went, they are appropriate.
Provide a functional means of freeing the returned array. I.e. tok_free(pointer).
Use the re-entrant version of strtok() initially, i.e. strtok_r(). It would not be cumbersome for someone to pass an additional argument (even NULL if not needed) for that.
there is a great tools to detect Memory leak which is called Valgrind.
http://valgrind.org/
If you want to find memory leaks, one possibility is to run it with valgrind.

Resources