I have to concatenate wide C-style strings, and based on my research, it seems that something like _memccpy is most ideal (in order to avoid Shlemiel's problem). But I can't seem to find a wide-character version. Does something like that exist?
Does something like that exist?
The C standard library does not contain a wide-character version of Microsoft's _memccpy(). Neither does it contain _memccpy() itself, although POSIX specifies the memccpy() function on which MS's _memccpy() appears to be modeled.
POSIX also defines wcpcpy() (a wide version of stpcpy()), which copies a a wide string and returns a pointer to the end of the result. That's not as fully featured as memccpy(), but it would suffice to avoid Shlemiel's problem, if only Microsoft's C library provided a version of it.
You can, however, use swprintf() to concatenate wide strings without suffering from Shlemiel's problem, with the added advantage that it is in the standard library, since C99. It does not provide the memccpy behavior of halting after copying a user-specified (wide) character, but it does return the number of wide characters written, which is equivalent to returning a pointer to the end of the result. Also, it can directly concatenate an arbitrary fixed number of strings in a single call. swprintf does have significant overhead, though.
But of course, if the overhead of swprintf puts you off then it's pretty easy to write your own. The result might not be as efficient as a well-tuned implementation from your library vendor, but we're talking about a scaling problem, so you mainly need to win on the asymptotic complexity front. Simple example:
/*
* Copies at most 'count' wide characters from 'src' to 'dest', stopping after
* copying a wide character with value 0 if that happens first. If no 0 is
* encountered in the first 'count' wide characters of 'src' then the result
* will be unterminated.
* Returns 'dest' + n, where n is the number of non-zero wide characters copied to 'dest'.
*/
wchar_t *wcpncpy(wchar_t *dest, const wchar_t *src, size_t count) {
for (wchar_t *bound = dest + count; dest < bound; ) {
if ((*dest++ = *src++) == 0) return dest - 1;
}
return dest;
}
Related
As we know, different encodings map different representations to same characters. Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well? I'd find this surprising since these are compile-time!
This matters for tasks as simple as, for example, determining whether a string read from input contains a specific character. When reading strings from input it seems sensible to set the locale to to the user's locale (setlocale("LC_ALL", "");) so that the string is read and processed correctly. But when we're comparing this string with a character literal, won't problems arise due to mismatched encoding?
In other words: The following snippet seems to work for me. But doesn't it work only because of coincidence? Because - for example? - the source code happened to be saved in the same encoding that is used on the machine during runtime?
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
setlocale(LC_ALL, "");
// Read line and convert it to wide string so that wcschr can be used
// So many lines! And that's even though I'm omitting the necessary
// error checking for brevity. Ah I'm also omitting free's
char *s = NULL; size_t n = 0;
getline(&s, &n, stdin);
mbstate_t st = {0}; const char* cs = s;
size_t wn = mbsrtowcs(NULL, &cs, 0, &st);
wchar_t *ws = malloc((wn+1) * sizeof(wchar_t));
st = (mbstate_t){0};
mbsrtowcs(ws, &cs, (wn+1), &st);
int contains_guitar = (wcschr(ws, L'🎸') != NULL);
if(contains_guitar)
printf("Let's rock!\n");
else
printf("Let's not.\n");
return 0;
}
How to do this correctly?
Using setlocale we can specify the encoding of strings that are read from input, but does this apply to string literals as well?
No. String literals use the execution character set, which is defined by your compiler at compile time.
Execution character set does not have to be the same as the source character set, the character set used in the source code. The C compiler is responsible for the translation, and should have options for choosing/defining them. The default depends on the compiler, but on Linux and most current POSIXy systems, is usually UTF-8.
The following snippet seems to work for me. But doesn't it work only because of coincidence?
The example works because the character set of your locale, the source character set, and the execution character set used when the binary was constructed, all happen to be UTF-8.
How to do this correctly?
Two options. One is to use wide characters and string literals. The other is to use UTF-8 everywhere.
For wide input and output, see e.g. this example in another answer here.
Do note that getwline() and getwdelim() are not in POSIX.1, but in C11 Annex K. This means they are optional, and as of this writing, not widely available at all. Thus, a custom implementation around fgetwc() is recommended instead. (One based on fgetws(), wcslen(), and/or wcscspn() will not be able to handle embedded nuls, L'\0', correctly.)
In a typical wide I/O program, you only need mbstowcs() to convert command-line arguments and environment variables to wide strings.
Using UTF-8 everywhere is also a perfectly valid practical approach, at least if it is well documented, so that users know the program inputs and outputs UTF-8 strings, and developers know to ensure their C compiler uses UTF-8 as the execution character set when compiling those binaries.
Your program can even use e.g.
if (!setlocale(LC_ALL, ""))
fprintf(stderr, "Warning: Your C library does not support your current locale.\n");
if (strcmp("UTF-8", nl_langinfo(CODESET)))
fprintf(stderr, "Warning: Your locale does not use the UTF-8 character set.\n");
to verify the current locale uses UTF-8.
I have used both approaches, depending on the circumstances. It is difficult to say which one is more portable in practice, because as usual, both work just fine on non-Windows OSes without issues.
If you're willing to assume UTF-8,
strstr(s,"🎸")
Or:
strstr(s,u8"🎸")
The latter avoids some assumptions but requires a C11 compiler. If you want the best of both and can sacrifice readability:
strstr(s,"\360\237\216\270")
For clarity I'm only talking about null terminated strings.
I'm familiar with the standard way of doing string comparisons in C with the usage of strcmp. But I feel like it's slow and inefficient.
I'm not necessarily looking for the easiest method but the most efficient.
Can the current comparison method (strcmp) be optimized further while the underlying code remains cross platform?
If strcmp can't be optimized further, what is the fastest way which I could perform the string comparison without strcmp?
Current use case:
Determine if two arbitrary strings match
Strings will not exceed 4096 bytes, nor be less than 1 byte in size
Strings are allocated/deallocated and compared within the same code/library
Once comparison is complete I do pass the string to another C library which needs the format to be in a standard null terminated format
System memory limits are not a huge concern, but I will have tens of thousands of such strings queued up for comparison
Strings may contain high-ascii character set or UTF-8 characters but for my purposes I only need to know if they match, content is not a concern
Application runs on x86 but should also run on x64
Reference to current strcmp() implementation:
How does strcmp work?
What does strcmp actually do?
GLIBC strcmp() source code
Edit: Clarified the solution does not need to be a modification of strcmp.
Edit 2: Added specific examples for this use case.
I'm afraid your reference imlementation for strcmp() is both inaccurate and irrelevant:
it is inaccurate because it compares characters using the char type instead of the unsigned char type as specified in the C11 Standard:
7.24.4 Comparison functions
The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.
It is irrelevant because the actual implementation used by modern compilers is much more sophisticated, expanded inline using hand-coded assembly language.
Any generic implementation is likely to be less optimal, especially if coded to remain portable across platforms.
Here are a few directions to explore if your program's bottleneck is comparing strings.
Analyze your algorithms, try and find ways to reduce the number of comparisons: for example if you search for a string in an array, sorting that array and using a binary search with drastically reduce the number of comparisons.
If your strings are tokens used in many different places, allocate unique copies of these tokens and use those as scalar values. The strings will be equal if and only if the pointers are equal. I use this trick in compilers and interpreters all the time with a hash table.
If your strings have the same known length, you can use memcmp() instead of strcmp(). memcmp() is simpler than strcmp() and can be implemented even more efficiently in places where the strings are known to be properly aligned.
EDIT: with the extra information provided, you could use a structure like this for your strings:
typedef struct string_t {
size_t len;
size_t hash; // optional
char str[]; // flexible array, use [1] for pre-c99 compilers
} string_t;
You allocate this structure this way:
string_t *create_str(const char *s) {
size_t len = strlen(s);
string_t *str = malloc(sizeof(*str) + len + 1;
str->len = len;
str->hash = hash_str(s, len);
memcpy(str->str, s, len + 1);
return str;
}
If you can use these str things for all your strings, you can greatly improve the efficiency of the matching by first comparing the lengths or the hashes. You can still pass the str member to your library function, it is properly null terminated.
Is there any practical difference between using wcscpy_s and using wcsncpy? The only difference seems to be the order of parameters and return value:
errno_t wcscpy_s(wchar_t *strDestination,
size_t numberOfElements,
const wchar_t *strSource);
wchar_t *wcsncpy(wchar_t *strDest,
const wchar_t *strSource,
size_t count );
And if there is no practical difference, why did Microsoft need to add wcscpy_s to Visual Studio, when wcsncpy was already available and a standard function?
Is it OK to replace wcscpy_s to wcsncpy when porting from Visual Studio to gcc?
These two functions do not have the same behavior.
From the MSDN documentation of wcscpy_s:
Upon successful execution, the destination string will always be null terminated.
From the specification of wcsncpy (C11 7.29.4.2.2/1-3):
#include <wchar.h>
wchar_t *wcsncpy(wchar_t * restrict s1,
const wchar_t * restrict s2,
size_t n);
The wcsncpy function copies not more than n wide characters (those that follow a null
wide character are not copied) from the array pointed to by s2 to the array pointed to by
s1.
If the array pointed to by s2 is a wide string that is shorter than n wide characters, null wide characters are appended to the copy in the array pointed to by s1, until n wide characters in all have been written
and the footnote (#346):
Thus, if there is no null wide character in the first n wide characters of the array pointed to by s2, the result will not be null-terminated.
Note that strncpy and wcsncpy are not designed for use with null-terminated strings. They are designed for use with null-padded, fixed-width strings.
Another difference (that just cost me a couple hours of staring at code, wondering what was going on) is that the wcscpy_s function will, by default, terminate the application if you're going to overrun the buffer.
I expected it to behave like one of the strncpy variants. That is not the case!
You can apparently change this behavior with the _set_invalid_parameter_handler function.
wcscpy_s is more secure, it can detect your error and trigger the Invalid Parameter Handler Routine. For handling such errors without causing crash, they have provided _set_invalid_parameter_handler.
The functions with the appended _s are functions whoch are more secure. Usually, the functions not conating the trailing _s will be marked as "deprecated" by VS2012 for example. You will get a warning. For additional information: MSDN has plenty of information on this.
I am developing a cross platform C (C89 standard) application which has to deal with UTF8 text. All I need is basic string manipulation functions like substr, first, last etc.
Question 1
Is there a UTF8 library that has the above functions implemented? I have already looked into ICU and it is too big for my requirement. I just need to support UTF8.
I have found a UTF8 decoder here. Following function prototypes are from that code.
void utf8_decode_init(char p[], int length);
int utf8_decode_next();
The initialization function takes a character array but utf8_decode_next() returns int. Why is that? How can I print the characters this function returns using standard functions like printf? The function is dealing with character data and how can that be assigned to a integer?
If the above decoder is not good for production code, do you have a better recommendation?
Question 2
I also got confused by reading articles that says, for unicode you need to use wchar_t. From my understanding this is not required as normal C strings can hold UTF8 values. I have verified this by looking at source code of SQLite and git. SQLite has the following typedef.
typedef unsigned char u8
Is my understanding correct? Also why is unsigned char required?
The utf_decode_next() function returns the next Unicode code point. Since Unicode is a 21-bit character set, it cannot return anything smaller than an int, and it can be argued that technically, it should be a long since an int could be a 16-bit quantity. Effectively, the function returns you a UTF-32 character.
You would need to look at the C94 wide character extensions to C89 to print wide characters (wprintf(), <wctype.h>, <wchar.h>). However, wide characters alone are not guaranteed to be UTF-8 or even Unicode. You most probably cannot print the characters from utf8_decode_next() portably, but it depends on what your portability requirements are. The wider the range of systems you must port to, the less chance there is of it all working simply. To the extent you can write UTF-8 portably, you would send the UTF-8 string (not an array of the UTF-32 characters obtained from utf8_decode_next()) to one of the regular printing functions. One of the strengths of UTF-8 is that it can be manipulated by code that is largely ignorant of it.
You need to understand that a 4-byte wchar_t can hold any Unicode codepoint in a single unit, but that UTF-8 can require between one and four 8-bit bytes (1-4 units of storage) to hold a single Unicode codepoint. On some systems, I believe wchar_t can be a 16-bit (short) integer. In this case, you are forced into using UTF-16, which encodes Unicode codepoints outside the Basic Multilingual Plane (BMP, code points U+0000 .. U+FFFF) using two storage units and surrogates.
Using unsigned char makes life easier; plain char is often signed. Having negative numbers makes life more difficult than it need me (and, believe me, it is difficult enough without adding complexity).
You do not need any special library routines for character or substring search with UTF-8. strstr does everything you need. That's the whole point of UTF-8 and the design requirements it was invented to meet.
GLib has quite a few relevant functions, and can be used independent of GTK+.
There are over 100,000 characters in Unicode. There are 256 possible values of char in most C implementations.
Hence, UTF-8 uses more than one char to encode each character, and the decoder needs a return type which is larger than char.
wchar_t is a larger type than char (well, it doesn't have to be larger, but it usually is). It represents the characters of the implementation-defined wide character set. On some implementations (most importantly, Windows, which uses surrogate pairs for characters outside the "basic multilingual plane"), it still isn't big enough to represent any Unicode character, which presumably is why the decoder you reference uses int.
You can't print wide characters using printf, because it deals in char. wprintf deals in wchar_t, so if the wide character set is unicode, and if wchar_t is int on your system (as it is on linux), then wprintf and friends will print the decoder output without further processing. Otherwise it won't.
In any case, you cannot portably print arbitrary unicode characters, because there's no guarantee that the terminal can display them, or even that the wide character set is in any way related to Unicode.
SQLite has probably used unsigned char so that:
they know the signedness - it's implementation-defined whether char is signed or not.
they can do right-shifts and assign out-of-range values, and get consistent and defined results across all C implementations. Implemenations have more freedom how signed char behaves than unsigned char.
Normal C strings are fine for storing utf8 data, but you can't easily search for a substring in your utf8 string. This is because a character encoded as a sequence of bytes using the utf8 encoding could be anywhere from one to 4 bytes depending on the character. i.e. a "character" is not equivalent to a "byte" for utf8 like it is for ASCII.
In order to do substring searches etc. you will need to decode it to some internal format that is used to represent Unicode characters and then do the substring search on that. Since there are far more than Unicode 256 characters, a byte (or char) is not enough. That's why the library you found uses ints.
As for your second question, it's probably just because it does not make sense to talk about negative characters, so they may as well be specified as "unsigned".
I have implemented a substr & length functions which supports UTF8 characters. This code is a modified version of what SQLite uses.
The following macro loops through the input text and skip all multi-byte sequence characters. if condition checks that this is a multi-byte sequence and the loop inside it increments input until it finds next head byte.
#define SKIP_MULTI_BYTE_SEQUENCE(input) { \
if( (*(input++)) >= 0xc0 ) { \
while( (*input & 0xc0) == 0x80 ){ input++; } \
} \
}
substr and length are implemented using this macro.
typedef unsigned char utf8;
substr
void *substr(const utf8 *string,
int start,
int len,
utf8 **substring)
{
int bytes, i;
const utf8 *str2;
utf8 *output;
--start;
while( *string && start ) {
SKIP_MULTI_BYTE_SEQUENCE(string);
--start;
}
for(str2 = string; *str2 && len; len--) {
SKIP_MULTI_BYTE_SEQUENCE(str2);
}
bytes = (int) (str2 - string);
output = *substring;
for(i = 0; i < bytes; i++) {
*output++ = *string++;
}
*output = '\0';
}
length
int length(const utf8 *string)
{
int len;
len = 0;
while( *string ) {
++len;
SKIP_MULTI_BYTE_SEQUENCE(string);
}
return len;
}
strncpy() supposedly protects from buffer overflows. But if it prevents an overflow without null terminating, in all likelihood a subsequent string operation is going to overflow. So to protect against this I find myself doing:
strncpy( dest, src, LEN );
dest[LEN - 1] = '\0';
man strncpy gives:
The strncpy() function is similar, except that not more than n bytes of src are copied. Thus, if there is no null byte among the first n bytes of src, the result will not be null-terminated.
Without null terminating something seemingly innocent like:
printf( "FOO: %s\n", dest );
...could crash.
Are there better, safer alternatives to strncpy()?
strncpy() is not intended to be used as a safer strcpy(), it is supposed to be used to insert one string in the middle of another.
All those "safe" string handling functions such as snprintf() and vsnprintf() are fixes that have been added in later standards to mitigate buffer overflow exploits etc.
Wikipedia mentions strncat() as an alternative to writing your own safe strncpy():
*dst = '\0';
strncat(dst, src, LEN);
EDIT
I missed that strncat() exceeds LEN characters when null terminating the string if it is longer or equal to LEN char's.
Anyway, the point of using strncat() instead of any homegrown solution such as memcpy(..., strlen(...))/whatever is that the implementation of strncat() might be target/platform optimized in the library.
Of course you need to check that dst holds at least the nullchar, so the correct use of strncat() would be something like:
if (LEN) {
*dst = '\0'; strncat(dst, src, LEN-1);
}
I also admit that strncpy() is not very useful for copying a substring into another string, if the src is shorter than n char's, the destination string will be truncated.
Originally, the 7th Edition UNIX file system (see DIR(5)) had directory entries that limited file names to 14 bytes; each entry in a directory consisted of 2 bytes for the inode number plus 14 bytes for the name, null padded to 14 characters, but not necessarily null-terminated. It's my belief that strncpy() was designed to work with those directory structures - or, at least, it works perfectly for that structure.
Consider:
A 14 character file name was not null terminated.
If the name was shorter than 14 bytes, it was null padded to full length (14 bytes).
This is exactly what would be achieved by:
strncpy(inode->d_name, filename, 14);
So, strncpy() was ideally fitted to its original niche application. It was only coincidentally about preventing overflows of null-terminated strings.
(Note that null padding up to the length 14 is not a serious overhead - if the length of the buffer is 4 KB and all you want is to safely copy 20 characters into it, then the extra 4075 nulls is serious overkill, and can easily lead to quadratic behaviour if you are repeatedly adding material to a long buffer.)
There are already open source implementations like strlcpy that do safe copying.
http://en.wikipedia.org/wiki/Strlcpy
In the references there are links to the sources.
Strncpy is safer against stack overflow attacks by the user of your program, it doesn't protect you against errors you the programmer do, such as printing a non-null-terminated string, the way you've described.
You can avoid crashing from the problem you've described by limiting the number of chars printed by printf:
char my_string[10];
//other code here
printf("%.9s",my_string); //limit the number of chars to be printed to 9
Some new alternatives are specified in ISO/IEC TR 24731 (Check https://buildsecurityin.us-cert.gov/daisy/bsi/articles/knowledge/coding/317-BSI.html for info). Most of these functions take an additional parameter that specifies the maximum length of the target variable, ensure that all strings are null-terminated, and have names that end in _s (for "safe" ?) to differentiate them from their earlier "unsafe" versions.1
Unfortunately, they're still gaining support and may not be available with your particular tool set. Later versions of Visual Studio will throw warnings if you use the old unsafe functions.
If your tools don't support the new functions, it should be fairly easy to create your own wrappers for the old functions. Here's an example:
errCode_t strncpy_safe(char *sDst, size_t lenDst,
const char *sSrc, size_t count)
{
// No NULLs allowed.
if (sDst == NULL || sSrc == NULL)
return ERR_INVALID_ARGUMENT;
// Validate buffer space.
if (count >= lenDst)
return ERR_BUFFER_OVERFLOW;
// Copy and always null-terminate
memcpy(sDst, sSrc, count);
*(sDst + count) = '\0';
return OK;
}
You can change the function to suit your needs, for example, to always copy as much of the string as possible without overflowing. In fact, the VC++ implementation can do this if you pass _TRUNCATE as the count.
1Of course, you still need to be accurate about the size of the target buffer: if you supply a 3-character buffer but tell strcpy_s() it has space for 25 chars, you're still in trouble.
Use strlcpy(), specified here: http://www.courtesan.com/todd/papers/strlcpy.html
If your libc doesn't have an implementation, then try this one:
size_t strlcpy(char* dst, const char* src, size_t bufsize)
{
size_t srclen =strlen(src);
size_t result =srclen; /* Result is always the length of the src string */
if(bufsize>0)
{
if(srclen>=bufsize)
srclen=bufsize-1;
if(srclen>0)
memcpy(dst,src,srclen);
dst[srclen]='\0';
}
return result;
}
(Written by me in 2004 - dedicated to the public domain.)
Instead of strncpy(), you could use
snprintf(buffer, BUFFER_SIZE, "%s", src);
Here's a one-liner which copies at most size-1 non-null characters from src to dest and adds a null terminator:
static inline void cpystr(char *dest, const char *src, size_t size)
{ if(size) while((*dest++ = --size ? *src++ : 0)); }
strncpy works directly with the string buffers available, if you are working directly with your memory, you MUST now buffer sizes and you could set the '\0' manually.
I believe there is no better alternative in plain C, but its not really that bad if you are as careful as you should be when playing with raw memory.
Without relying on newer extensions, I have done something like this in the past:
/* copy N "visible" chars, adding a null in the position just beyond them */
#define MSTRNCPY( dst, src, len) ( strncpy( (dst), (src), (len)), (dst)[ (len) ] = '\0')
and perhaps even:
/* pull up to size - 1 "visible" characters into a fixed size buffer of known size */
#define MFBCPY( dst, src) MSTRNCPY( (dst), (src), sizeof( dst) - 1)
Why the macros instead of newer "built-in" (?) functions? Because there used to be quite a few different unices, as well as other non-unix (non-windows) environments that I had to port to back when I was doing C on a daily basis.
I have always preferred:
memset(dest, 0, LEN);
strncpy(dest, src, LEN - 1);
to the fix it up afterwards approach, but that is really just a matter of preference.
These functions have evolved more than being designed, so there really is no "why".
You just have to learn "how". Unfortunately the linux man pages at least are
devoid of common use case examples for these functions, and I've noticed lots
of misuse in code I've reviewed. I've made some notes here:
http://www.pixelbeat.org/programming/gcc/string_buffers.html