Here is a C code that converts a wchar_t* string into a char* string :
wchar_t *myXML = L"<test/>";
size_t length;
char *charString;
size_t i;
length = wcslen(myXML);
charString = (char *)malloc(length);
wcstombs_s(&i, charString, length, myXML, length);
The code compiles but at exectution it detects a fatal error at the last line and stops running.
Now, if I replace the last line with this one :
wcstombs_s(&i, charString, length+1, myXML, length);
I just added +1 to the third argument. Then it works perfectly...
Why is there a need to add this trick ? Or is there a flaw elsewhere in my code ?
You need one extra byte for the '\0' terminator character. wcslen does not include this in the length it returns!
To do this properly, you don't just need to pass length+1 to wcstombs_s but also to malloc:
charString = (char *)malloc(length+1);
wcstombs_s(&i, charString, length+1, myXML, length);
And even then, I suspect it will not work correctly. Not all wide characters can be mapped to a single char, so for non-ASCII characters you will need extra space in the multi-byte string.
DESCRIPTION
The wcslen() function is the wide-character
equivalent of the strlen(3) function. It determines
the length of the wide-character string pointed to by
s, not including the terminating L'\0' character.
The trick is that you should always look for code of the form:
string = malloc(len);
very suspiciously, because both wcslen(3) and strlen(3) return the string length without the nul byte, and malloc(3) must allocate the space with that byte. C kinda sucks sometimes.
So every time you see string = malloc(len); rather than string = malloc(len+1);, be very careful to read how len gets assigned.
char String = (char *)malloc(length + 1);
Ought to do the trick. :)
EDIT:
Better would be to ask wcstombs() for the size to allocate in the first place:
size_t len = wcstombs(NULL,src,0) + 1;
char *dest = malloc(len);
len = wcstombs(dest, src, len);
if (len == -1) /* handle error */ ...
The +1 allocates for the ascii nul, and wcstombs() will report how much memory is required to do the conversion. It'll do the conversion twice, once to keep track of the memory required, and then once to store the result, but it will be MUCH simpler to maintain. The second time, when it stores the result, it will write at most len bytes including the ascii nul.
Related
I've spotted the following piece of C code, marked as BAD (aka buffer overflow bad).
The problem is I don't quite get why? The input string length is captured before the allocation etc.
char *my_strdup(const char *s)
{
size_t len = strlen(s) + 1;
char *c = malloc(len);
if (c) {
strcpy(c, s); // BAD
}
return c;
}
Update from comments:
the 'BAD' marker is not precise, the code is not bad, not efficient yes, risky (below) yes,
why risky? +1 after the strlen() call is required to safely allocate the space on heap that also will keep the string terminator ('\0')
There is no bug in your sample function.
However, to make it obvious to future readers (both human and mechanical) that there is no bug, you should replace the strcpy call with a memcpy:
char *my_strdup(const char *s)
{
size_t len = strlen(s) + 1;
char *c = malloc(len);
if (c) {
memcpy(c, s, len);
}
return c;
}
Either way, len bytes are allocated and len bytes are copied, but with memcpy that fact stands out much more clearly to the reader.
There's no problem with this code.
While it's possible that strcpy can cause undefined behavior if the destination buffer isn't large enough to hold the string in question, the buffer is allocated to be the correct size. This means there is no risk of overrunning the buffer.
You may see some guides recommend using strncpy instead, which allows you to specify the maximum number of characters to copy, but this has its own problems. If the source string is too long, only the specified number of characters will be copied, however this also means that the string isn't null terminated which requires the user to do so manually. For example:
char src[] = "test data";
char dest[5];
strncpy(dest, src, sizeof dest); // dest holds "test " with no null terminator
dest[sizeof(dest) - 1] = 0; // manually null terminate, dest holds "test"
I tend towards the use of strcpy if I know the source string will fit, otherwise I'll use strncpy and manually null-terminate.
I cannot see any problem with the code when it comes to the use of strcpy
But you should be aware that it requires s to be a valid C string. That is a reasonable requirement, but it should be specified.
If you want, you could put in a simple check for NULL, but I would say that it's ok to do without it. If you're about to make a copy of a "string" pointed to by a null pointer, then you probably should check either the argument or the result. But if you want, just add this as the first line:
if(!s) return NULL;
But as I said, it does not add much. It just makes it possible to change
if(!str) {
// Handle error
} else {
new_str = my_strdup(str);
}
to:
new_str = my_strdup(str);
if(!new_str) {
// Handle error
}
Not really a huge gain
I have the following code in C now
int length = 50
char *target_str = (char*) malloc(length);
char *source_str = read_string_from_somewhere() // read a string from somewhere
// with length, say 20
memcpy(target_str, source_str, length);
The scenario is that target_str is initialized with 50 bytes. source_str is a string of length 20.
If I want to copy the source_str to target_str i use memcpy() as above with length 50, which is the size of target_str. The reason I use length in memcpy is that, the source_str can have a max value of length but is usually less than that (in the above example its 20).
Now, if I want to copy till length of source_str based on its terminating character ('\0'), even if memcpy length is more than the index of terminating character, is the above code a right way to do it? or is there an alternative suggestion.
Thanks for any help.
The scenario is that target_str is initialized with 50 bytes. source_str is a string of length 20.
If I want to copy the source_str to target_str i use memcpy() as above with length 50, which is the size of target_str.
currently you ask for memcpy to read 30 characters after the end of the source string because it does not care of a possible null terminator on the source, this is an undefined behavior
because you copy a string you can use strcpy rather than memcpy
but the problem of size can be reversed, I mean the target can be smaller than the source, and without protection you will have again a undefined behavior
so you can use strncpy giving the length of the target, just take care of the necessity to add a final null character in case the target is smaller than the source :
int length = 50
char *target_str = (char*) malloc(length);
char *source_str = read_string_from_somewhere(); // length unknown
strncpy(target_str, source_str, length - 1); // -1 to let place for \0
target_str[length - 1] = 0; // force the presence of a null character at end in case
If I want to copy the source_str to target_str i use memcpy() as above
with length 50, which is the size of target_str. The reason I use
length in memcpy is that, the source_str can have a max value of
length but is usually less than that (in the above example its 20).
It is crucially important to distinguish between
the size of the array to which source_str points, and
the length of the string, if any, to which source_str points (+/- the terminator).
If source_str is certain to point to an array of length 50 or more then the memcpy() approach you present is ok. If not, then it produces undefined behavior when source_str in fact points to a shorter array. Any result within the power of your C implementation may occur.
If source_str is certain to point to a (properly-terminated) C string of no more than length - 1 characters, and if it is its string value that you want to copy, then strcpy() is more natural than memcpy(). It will copy all the string contents, up to and including the terminator. This presents no problem when source_str points to an array shorter than length, so long as it contains a string terminator.
If neither of those cases is certain to hold, then it's not clear what you want to do. The strncpy() function may cover some of those cases, but it does not cover all of them.
Now, if I want to copy till length of source_str based on its terminating character ('\0'), even if memcpy length is more than the index of terminating character, is the above code a right way to do it?
No; you'd be copying the entire content of source_str, even past the null-terminator if it occurs before the end of the allocated space for the string it is pointing to.
If your concern is minimizing the auxiliary space used by your program, what you could do is use strlen to determine the length of source_str, and allocate target_str based on that. Also, strcpy is similar to memcpy but is specifically intended for null-terminated strings (observe that it has no "size" or "length" parameter):
char *target_str = NULL;
char *source_str = read_string_from_somewhere();
size_t len = strlen(source_str);
target_str = malloc(len + 1);
strcpy(target_str, source_str);
// ...
free(target_str);
target_str = NULL;
memcpy is used to copy fixed blocks of memory, so if you want to copy something shorter that is terminated by '\n' you don't want to use memcpy.
There is other functions like strncpy or strlcpy that do similar things.
Best to check what the implementations do. I removed the optimized versions from the original source code for the sake of readability.
This is an example memcpy implementation: https://git.musl-libc.org/cgit/musl/tree/src/string/memcpy.c
void *memcpy(void *restrict dest, const void *restrict src, size_t n)
{
unsigned char *d = dest;
const unsigned char *s = src;
for (; n; n--) *d++ = *s++;
return dest;
}
It's clear that here, both pieces of memory are visited for n times. regardless of the size of source or destination string, which causes copying of memory past your string if it was shorter. Which is bad and can cause various unwanted behavior.
this is strlcpy from: https://git.musl-libc.org/cgit/musl/tree/src/string/strlcpy.c
size_t strlcpy(char *d, const char *s, size_t n)
{
char *d0 = d;
size_t *wd;
if (!n--) goto finish;
for (; n && (*d=*s); n--, s++, d++);
*d = 0;
finish:
return d-d0 + strlen(s);
}
The trick here is that n && (*d = 0) evaluates to false and will break the looping condition and exit early.
Hence this gives you the wanted behaviour.
Use strlen to determine the exact size of source_string and allocate accordingly, remembering to add an extra byte for the null terminator. Here's a full example:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void) {
char *source_str = "string_read_from_somewhere";
int len = strlen(source_str);
char *target_str = malloc(len + 1);
if (!target_str) {
fprintf(stderr, "%s:%d: malloc failed", __FILE__, __LINE__);
return 1;
}
memcpy(target_str, source_str, len + 1);
puts(target_str);
free(target_str);
return 0;
}
Also, there's no need to cast the result of malloc. Don't forget to free the allocated memory.
As mentioned in the comments, you probably want to restrict the size of the malloced string to a sensible amount.
I have a function that takes as its input a string containing a hyperlink and is attempting to output that same hyperlink except that if it contains a question mark, that character and any characters that follow it are purged.
First, I open a text file and read in a line containing a link and only a link like so:
FILE * ifp = fopen(raw_links,"r");
char link_to_filter[200];
if(ifp == NULL)
{
printf("Could not open %s for writing\n", raw_links);
exit(0);
}
while(fscanf(ifp,"%s", link_to_filter) == 1)
{
add_link(a,link_to_filter, link_under_wget);
};
fclose(ifp);
Part of what add_link does is strip the unnecessary parts of the link after a question mark (like with xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/online-giving/step1.php?new=1) which are causing an issue with my calls to wget. It does this by feeding link_to_filter through this function remove_extra, seen below.
char * remove_extra(char * url)
{
char * test = url;
int total_size;
test = strchr(url,'?');
if (test != NULL)
{
total_size = test-url;
url[total_size] = '\0';
}
return url;
}
at the end of remove_extra, upon returning from remove_extra and immediately prior to using strcpy a call to printf like so
printf("%s",url);
will print out what I expect to see (e.g. xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/online-giving/step1.php without the '?' or trailing characters), but immediately after this block of code runs
struct node * temp = (struct node *)malloc(sizeof(struct node));
char * new_link = remove_extra(url);
temp->hyperlink =(char *)malloc(strlen(new_link) * sizeof(char));
strncpy(temp->hyperlink, new_link, strlen(new_link));
the result of a printf on member hyperlink occasionally has a single, junk character at the end (sometimes 'A' or 'Q' or '!', but always the same character corresponding to the same string). If this were happening with every link or with specific types of links, I could figure something out,
but it's only for maybe every 20th link and it happens to links both short and long.
e.g.
xxxxxxxxxxxxxxxxxxxx/hr/ --> xxxxxxxxxxxxxxxxxxxx/hr/!
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/windows-to-the-past/ --> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/windows-to-the-past/Q
This is happening both with strcpy and homemade string copy loops so I'm inclined to believe that it's not strcpy's fault, but I can't think of where else the error would lie.
If you compute len as the length of src, then the naive use of strncpy -- tempting though it might be -- is incorrect:
size_t len = strlen(src);
dest = malloc(len); /* DON'T DO THIS */
strncpy(dest, src, len); /* DON'T DO THIS, EITHER */
strncpy copies exactly len bytes; it does not guarantee to put a NUL byte at the end. Since len is precisely the length of src, there is no NUL byte within the first len bytes of src, and no NUL byte will be inserted by strncpy.
If you had used strcpy instead (the supposedly "unsafe" interface):
strcpy(dest, src);
it would have been fine, except for the fact that dest is not big enough. What you really need to do is this:
dest = malloc(strlen(src) + 1); /* Note: include space for the NUL */
strcpy(dest, src);
or, if you have the useful strdup function:
dest = strdup(src);
Mostly likely You forgot to copy a null terminator at the end of string, remember that strlen(str) gives the number of visible characters, You also need '\0' at the end.
You need to do
temp->hyperlink =(char *)malloc(strlen(new_link)+1);//char is one byte, sizeof(char)=1
and
strcpy(temp->hyperlink, new_link); Should work fine.
Why not use strdup?
I know strncpy(s1, s2, n) copies n elements of s2 into s1, but that only fills it from the beginning of s1.
For example
s1[10] = "Teacher"
s2[20] = "Assisstant"
strncpy(s1, s2, 2) would yield s1[10] = "As", correct?
Now what if I want s1 to contain "TeacherAs"? How would I do that? Is strncpy the appropriate thing to use in this case?
You can use strcat() to concatenate strings, however you don't want all of the source string copied in this case, so you need to use something like:
size_t len = strlen(s1);
strncpy(s1 + len - 1, s2, 2);
s2[len + 2] = '\0';
(Add terminating nul; thanks #FatalError).
Which is pretty horrible and you need to worry about the amount of space remaining in the destination array. Please note that if s1 is empty that code will break!
There is strncat() (manpage) under some systems, which is much simpler to use:
strncat(s1, s2, 2);
Use strcat.
Make sure your string you're appending to is big enough to hold both strings. In your case it isn't.
From the link above:
char * strcat ( char * destination, const char * source );
Concatenate strings
Appends a copy of the source string to the destination string. The terminating null character in destination is overwritten by the first character of source, and a null-character is included at the end of the new string formed by the concatenation of both in destination.
destination and source shall not overlap.
In order to achieve what you need you have to use strlcat (but beware! it is considered insecure)
strlcat(s1, s2, sizeof(s1));
This will concatenate to s1, part of the s2 string, until the size of s1 is reached (this avoids memory overflow)
then you'll get into s1 the string TeacherAs + a NUL char to terminate it
you need to make sure that you have enough memory is allocated for the resulting string
s1[10]
is not enough space to fit 'TeacherAs'.
from there, you'll want to do something like
//make sure s1 is big enough to hold s1+s2
s1[40]="Teacher";
s2[20]="Assistant";
//how many chars from second string you want to append
int offset = 2;
//allocate a temp buffer
char subbuff[20];
//copy n chars to buffer
memcpy( subbuff, s2, offset );
//null terminate buff
subbuff[offset+1]='\0';
//do the actual cat
strcat(s1,subbuff);
I'd suggest using snprintf(), like:
size_t len = strlen(s1);
snprintf(s1 + len, sizeof(s1) - len, "%.2s", s2);
snprintf() will always nul terminate and won't overrun your buffer. Plus, it's standard as of C99. As a note, this assumes that s1 is an array declared in the current scope so that sizeof works, otherwise you'll need to provide the size.
I am trying to create a simple datastructure that will make it easy to convert back and forth between ASCII strings and Unicode strings. My issue is that the length returned by the function mbstowcs is correct but the length returned by the function wcslen, on the newly created wchar_t string, is not. Am I missing something here?
typedef struct{
wchar_t *string;
long length; // I have also tried int, and size_t
} String;
void setCString(String *obj, char *str){
obj->length = strlen(str);
free(obj->string); // Free original string
obj->string = (wchar_t *)malloc((obj->length + 1) * sizeof(wchar_t)); //Allocate space for new string to be copied to
//memset(obj->string,'\0',(obj->length + 1)); NOTE: I tried this but it doesn't make any difference
size_t length = 0;
length = mbstowcs(obj->string, (const char *)str, obj->length);
printf("Length = %d\n",(int)length); // Prints correct length
printf("!C string %s converted to wchar string %ls\n",str,obj->string); //obj->string is of a wcslen size larger than Length above...
if(length != wcslen(obj->string))
printf("Length failure!\n");
if(length == -1)
{
//Conversion failed, set string to NULL terminated character
free(obj->string);
obj->string = (wchar_t *)malloc(sizeof(wchar_t));
obj->string = L'\0';
}
else
{
//Conversion worked! but wcslen (and printf("%ls)) show the string is actually larger than length
//do stuff
}
}
The code seems to work fine for me. Can you provide more context, such as the content of strings you're passing to it, and what locale you're using?
A few other bugs/style issues I noticed:
obj->length is left as the allocated length, rather than updated to match the length in (wide) characters. Is that your intention?
The cast to const char * is useless and bad style.
Edit: Upon discussion, it looks like you may be using a nonconformant Windows version of the mbstowcs function. If so, your question should be updated to reflect as such.
Edit 2: The code only happened to work for me because malloc returned a fresh, zero-filled buffer. Since you are passing obj->length to mbstowcs as the maximum number of wchar_t values to write to the destination, it will run out of space and not be able to write the null terminator unless there's a proper multibyte character (one which requires more than a single byte) in the source string. Change this to obj->length+1 and it should work fine.
The length you need to pass to mbstowcs() includes the L'\0' terminator character, but your calculated length in obj->length() does not include it - you need to add 1 to the value passed to mbstowcs().
In addition, instead of using strlen(str) to determine the length of the converted string, you should be using mbstowcs(0, src, 0) + 1. You should also change the type of str to const char *, and elide the cast. realloc() can be used in place of a free() / malloc() pair. Overall, it should look like:
typedef struct {
wchar_t *string;
size_t length;
} String;
void setCString(String *obj, const char *str)
{
obj->length = mbstowcs(0, src, 0);
obj->string = realloc(obj->string, (obj->length + 1) * sizeof(wchar_t));
size_t length = mbstowcs(obj->string, str, obj->length + 1);
printf("Length = %zu\n", length);
printf("!C string %s converted to wchar string %ls\n", str, obj->string);
if (length != wcslen(obj->string))
printf("Length failure!\n");
if (length == (size_t)-1)
{
//Conversion failed, set string to NULL terminated character
obj->string = realloc(obj->string, sizeof(wchar_t));
obj->string = L'\0';
}
else
{
//Conversion worked!
//do stuff
}
}
Mark Benningfield points out that mbstowcs(0, src, 0) is a POSIX / XSI extension to the C standard - to obtain the required length under only standard C, you must instead use:
const char *src_copy = src;
obj->length = mbstowcs(NULL, &src_copy, 0, NULL);
I am running this on Ubuntu linux with UTF-8 as locale.
Here is the additional info as requested:
I am calling this function with a fully allocated structure and passing in a hard coded "string" (not a L"string"). so I call the function with what is essentially setCString(*obj, "Hello!").
Length = 6
!C string Hello! converted to wchar string Hello!xxxxxxxxxxxxxxxxxxxx
(where x = random data)
Length failure!
for reference
printf("wcslen = %d\n",(int)wcslen(obj->string)); prints out as
wcslen = 11