Which string manipulation functions should I use? - c

On my Windows/Visual C environment there's a wide number of alternatives for doing the same basic string manipulation tasks.
For example, for doing a string copy I could use:
strcpy, the ANSI C standard library function (CRT)
lstrcpy, the version included in kernel32.dll
StrCpy, from the Shell Lightweight Utility library
StringCchCopy/StringCbCopy, from a "safe string" library
strcpy_s, security enhanced version of CRT
While I understand that all these alternatives have an historical reason, can I just choose a consistent set of functions for new code? And which one? Or should I choose the most appropriate function case by case?

First of all, let's review pros and cons of each function set:
ANSI C standard library function (CRT)
Functions like strcpy are the one and only choice if you are developing portable C code. Even in a Windows-only project, it might be a wise thing to have a separation of portable vs. OS-dependent code.
These functions have often assembly level optimization and are therefore very fast.
There are some drawbacks:
they have many limitations and therefore often you still have to call functions from other libraries or provide your own versions
there are some archaisms like the infamous strncpy
Kernel32 string functions
Functions like lstrcpy are exported by kernel32 and should be used only when trying to avoid any dependency to the CRT. You might want to do that for two reasons:
avoiding the CRT payload for an ultra lightweight executable (unusual these days but not in the 90s!)
avoiding initialization issues (if you launch a thread with CreateThread instead of _beginthread).
Moreover, the kernel32 function could be more optimized that the CRT version: when your executable will run on Windows 12 optimized for a Core i13, kernel32 could use an assembly-optimized version.
Shell Lightweight Utility Functions
Here are valid the same considerations made for the kernel32 functions, with the added value of some more complex functions. However I doubt that they are actively maintained and I would just skip them.
StrSafe Function
The StringCchCopy/StringCbCopy functions are usually my personal choice: they are very well designed, powerful, and surprisingly fast (I also remember a whitepaper that compared performance of these functions to the CRT equivalents).
Security-Enhanced CRT functions
These functions have the undoubted benefit of being very similar to ANSI C equivalents, so porting legacy code is a piece of cake. I especially like the template-based version (of course, available only when compiling as C++). I really hope that they will be eventually standardized. Unfortunately they have a number of drawbacks:
although a proposed standard, they have been basically rejected by the non-Windows community (probably just because they came from Microsoft)
when fail, they don't just return an error code but execute an invalid parameter handler
Conclusions
While my personal favorite for Windows development is the StrSafe library, my advice is to use the ANSI C functions whenever is possible, as portable-code is always a good thing.
In the real life, I developed a personalized portable library, with prototypes similar to the Security-Enhanced CRT functions (included the powerful template based technique), that relies on the StrSafe library on Windows and on the ANSI C functions on other platforms.

My personal preference, for both new and existing projects, are the StringCchCopy/StringCbCopy versions from the safe string library. I find these functions to be overall very consistent and flexible. And they were designed from the groupnd up with safety / security in mind.

I'd answer this question slightly different. Do you want to have portable code or not? If you want to be portable you can not rely on anything else but strcpy, strncpy, or the standard wide character "string" handling functions.
Then if your code just has to run under Windows you can use the "safe string" variants.
If you want to be portable and still want to have some extra safety, than you should check cross-platform libraries like e.g
glib or
libapr
or other "safe string libraries" like e.g:
SafeStrLibrary

I would suggest using functions from the standard library, or functions from cross-platform libraries.

I would stick to one, I would pick whichever one is in the most useful library in case you need to use more of it, and I would stay away from the kernel32.dll one as it's windows only.
But these are just tips, it's a subjective question.

Among those choices, I would simply use strcpy. At least strcpy_s and lstrcpy are cruft that should never be used. It's possibly worthwhile to investigate those independently written library functions, but I'd be hesitant to throw around nonstandard library code as a panacea for string safety.
If you're using strcpy, you need to be sure your string fits in the destination buffer. If you just allocated it with size at least strlen(source)+1, you're fine as long as the source string is not simultaneously subject to modification by another thread. Otherwise you need to test if it fits in the buffer. You can use interfaces like snprintf or strlcpy (nonstandard BSD function, but easy to copy an implementation) which will truncate strings that don't fit in your destination buffer, but then you really need to evaluate whether string truncation could lead to vulnerabilities in itself. I think a much better approach when testing whether the source string fits is to make a new allocation or return an error status rather than performing blind truncation.
If you'll be doing a lot of string concatenation/assembly, you really should write all your code to manage the length and current position as you go. Instead of:
strcpy(out, str1);
strcat(out, str2);
strcat(out, str3);
...
You should be doing something like:
size_t l, n = outsize;
char *s = out;
l = strlen(str1);
if (l>=outsize) goto error;
strcpy(s, str1);
s += l;
n -= l;
l = strlen(str2);
if (l>=outsize) goto error;
strcpy(s, str2);
s += l;
n -= l;
...
Alternatively you could avoid modifying the pointer by keeping a current index i of type size_t and using out+i, or you could avoid the use of size variables by keeping a pointer to the end of the buffer and doing things like if (l>=end-s) goto error;.
Note that, whichever approach you choose, the redundancy can be condensed by writing your own (simple) functions that take pointers to the position/size variable and call the standard library, for instance something like:
if (!my_strcpy(&s, &n, str1)) goto error;
Avoiding strcat also has performance benefits; see Schlemiel the Painter's algorithm.
Finally, you should note that a good 75% of the string copying and assembly people perform in C is utterly useless. My theory is that the people doing it come from backgrounds in script languages where putting together strings is what you do all the time, but in C it's not useful that often. In many cases, you can get by with never copying strings at all, using the original copies instead, and get much better performance and simpler code at the same time. I'm reminded of a recent SO question where OP was using regexec to match a regular expression, then copying out the result just to print it, something like:
char *tmp = malloc(match.end-match.start+1);
memcpy(tmp, src+match.start, match.end-match.start);
tmp[match.end-match.start] = 0;
printf("%s\n", tmp);
free(tmp);
The same thing can be accomplished with:
printf("%.*s\m", match.end-match.start, src+match.start);
No allocations, no cleanup, no error cases (the original code crashed if malloc failed).

Related

Why there is so many functions in string.h library that are "not recommended for use"?

There is something I try to understand about C origins, why there are functions that are not recommended for use in most of SO questions. Like strtok or strncpy, they are simply not safe to work with. Evrywhere I see recomendations to write my own implementation. Why wouldn't the standard change strncpy for example to BSD strlcpy, but is left instead with these "monsters"?
C is a product of the early 1970s, and it shows. Many of the iffier library functions were written when the C user community was very small and limited to academia, most of whom were experienced programmers.
By the time the first standard was released in 1989, those original library functions were already entrenched in 10 to 15 years' worth of legacy code (not the least of which was the Unix operating system and most of its tools). The committee in charge of standardization was loath to break the existing codebase, so those functions were incorporated into the standard pretty much as-is; all that really changed was adding prototype syntax to the declarations and changing char * to void * where necessary (malloc, memcpy, memset, etc.).
AFAIK, only one library function has actually been removed from the language since standardization - gets. The mayhem caused by that one library call is scarier than the prospect of breaking what is by now almost 40 years' worth of legacy code.
There is a LOT of legacy "C" and "C++" code out there. If they removed all the "unsafe" functions from the "C" runtime libraries, it would be prohibitive for many developers to upgrade their compilers because all the old code wouldn't build any more.
Sometimes they will give "deprecated" compiler messages (MSFT is fond of this) so you will find and change to using the new, safer functions.
New code should use the "safe" functions, of course, but many of us are stuck with old compilers and legacy code to maintain :)
They still exist because of historical ancestral relationship with the "old system" / "codes" that still use them - i.e. to support "Backward Compatibility"
Own implementation is suggested to make the programmer use their own logic at their own risk as no one can know much better about their environment then the programmer himself, as for example, strtok is not thread safe.
It's all just dogma. Use the functions just be aware that they're indifferent to your goals in that they might not work in all circumstances (ie strtok and multi-threading) or they expect conditions to be caught before/after usage (ie strncpy and missing termination characters).

Is glib usable in an unobtrusive way?

I was looking for a good general-purpose library for C on top of the standard C library, and have seen several suggestions to use glib. How 'obtrusive' is it in your code? To explain what I mean by obtrusiveness, the first thing I noticed in the reference manual is the basic types section, thinking to myself, "what, am I going to start using gint, gchar, and gprefixing geverything gin gmy gcode gnow?"
More generally, can you use it only locally without other functions or files in your code having to be aware of its use? Does it force certain assumptions on your code, or constraints on your compilation/linking process? Does it take up a lot of memory in runtime for global data structures? etc.
The most obtrustive thing about glib is that any program or library using it is non-robust against resource exhaustion. It unconditionally calls abort when malloc fails and there's nothing you can do to fix this, as the entire library is designed around the concept that their internal allocation function g_malloc "can't fail"
As for the ugly "g" types, you definitely don't need any casts. The types are 100% equivalent to the standard types, and are basically just cruft from the early (mis)design of glib. Unfortunately the glib developers lack much understanding of C, as evidenced by this FAQ:
Why use g_print, g_malloc, g_strdup and fellow glib functions?
"Regarding g_malloc(), g_free() and siblings, these functions are much safer than their libc equivalents. For example, g_free() just returns if called with NULL.
(Source: https://developer.gnome.org/gtk-faq/stable/x908.html)
FYI, free(NULL) is perfectly valid C, and does the exact same thing: it just returns.
I have used GLib professionally for over 6 years, and have nothing but praise for it. It is very light-weight, with lots of great utilities like lists, hashtables, rand-functions, io-libraries, threads/mutexes/conditionals and even GObject. All done in a portable way. In fact, we have compiled the same GLib-code on Windows, OSX, Linux, Solaris, iOS, Android and Arm-Linux without any hiccups on the GLib side.
In terms of obtrusiveness, I have definitely "bought into the g", and there is no doubt in my mind that this has been extremely beneficial in producing stable, portable code at great speed. Maybe specially when it comes to writing advanced tests.
And if g_malloc don't suit your purpose, simply use malloc instead, which of course goes for all of it.
Of course you can "forget about it elsewhere", unless of course those other places somehow interact with glib code, then there's a connection (and, arguable, you're not really "elsewhere").
You don't have to use the types that are just regular types with a prepended g (gchar, gint and so on); they're guaranteed to be the same as char, int and so on. You never need to cast to/from gint for instance.
I think the intention is that application code should never use gint, it's just included so that the glib code can be more consistent.

Should I use secure versions of POSIX functions on MSVC - C

I am writing some C code which is expected to compile on multiple compilers (at least on MSVC and GCC). Since I am beginner in C, I have all warnings turned on and warnings are treated as errors (-Werror in GCC & /WX in MSVC) to prevent me from making silly mistakes.
When I compiled some code that uses strcpy on MSVC, I get warning like,
warning C4996: 'strcpy': This function or variable may be unsafe. Consider using strcpy_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.
I am bit confused. Lot of common functions are deprecated on MSVC. Should I use this secured version when on Windows? If yes, should I wrap strcpy something like,
my_strcpy()
{
#ifdef WIN32
// use strcpy_s
#ELSE
// use strcpy
}
Any thoughts?
Whenever you move data between non-constant-size buffers, you have to (gasp! omg!) actually think about whether it fits. Using functions (like the MS-specific strcpy_s or the BSD strlcpy) that purport to be "safe" will protect you from some obvious buffer overflow conditions, but won't protect you from the bugs that result from string truncation. It also won't protect you from integer overflows in computing the necessary sizes of buffers.
Unless you're an expert dealing with C strings, I would recommend forgetting about special functions and commenting every line of your code that will perform variable-length/position writes with a justification for how you know, at this point in the program, that the length/offset you're about to use is within the bounds of the size of the buffer. Do this for lines where you perform arithmetic on sizes/offsets too - document how you know that the arithmetic will not overflow, and add tests for overflow if you find you don't know.
Another approach is to completely wrap all your string handling in a string object that stores the length of the buffer along with the string and automatically reallocates when a string needs to be enlarged, and then only use const char * for read-only access to strings when you need to pass them to system functions or other libraries. This will sacrifice a good bit of the performance you'd expect from C, but it will help you ensure that you don't make mistakes. Just don't take it to the extreme. There's no need to duplicate stuff like strchr, strstr, etc. in your string wrapper. Just provide methods to duplicate string objects, concatenate them, and truncate them, and then with the existing library functions that operate on const char * you can do just about anything you'd want to.
There are lots and lots of discussions about this topic here on SO. The usual suspects like strncpy, strlcpy and whatever will pop up here again, I'm sure. Just type "strcpy" in the search box and read some of the longer threads to get an overview.
My advice is: Whatever your final choice will be, it is a good idea to follow the DRY principle and continue to do it as in your example of my_strcpy(). Don't throw the raw calls all over your code, use wrappers and centralize them in your own string handling library. This will reduce overall code (boilerplate), and you have one central location to make modifications, if you change your mind later.
Of course this opens up some other cans of worms, especially for a beginner: Memory handling responsibility and interface design. Both a topic on its own, and 5 people will give you 10 suggestions of how to do it. A central library usually has the nice effect that it enforces a decision, which you will follow throughout your whole codebase, instead of using method a in module A and method b in module B, causing you trouble when you try to connect A with B...
I would tend to use the safer function snprintf † which is available on both platforms rather than having different paths depending on platform. You will need to use the define to prevent the warnings on MSVC.
† though possibly slightly less safer - it will return a string which is not nul-terminated on error, so you must check the return, but it won't cause a buffer overflow.

A pure bytes version of strstr?

Is there a version of strstr that works over a fixed length of memory that may include null characters?
I could phrase my question like this:
strncpy is to memcpy as strstr is to ?
memmem, unfortunately it's GNU-specific rather than standard C. However, it's open-source so you can copy the code (if the license is amenable to you).
Not in the standard library (which is not that large, so take a look). However writing your own is trivial, either directly byte by byte or using memchr() followed by memcmp() iteratively.
In the standard library, no. However, a quick google search for "safe c string library" turns up several potentially useful results. Without knowing more about the task you are trying to perform, I cannot recommend any particular third-party implementation.
If this is the only "safe" function that you need beyond the standard functions, then it may be best to roll your own rather than expend the effort of integrating a third-party library, provided you are confident that you can do so without introducing additional bugs.

Safer Alternatives to the C Standard Library

The C standard library is notoriously poor when it comes to I/O safety. Many functions have buffer overflows (gets, scanf), or can clobber memory if not given proper arguments (scanf), and so on. Every once in a while, I come across an enterprising hacker who has written his own library that lacks these flaws.
What are the best of these libraries you have seen? Have you used them in production code, and if so, which held up as more than hobby projects?
I use GLib library, it has many good standard and non standard functions.
See https://developer.gnome.org/glib/stable/
and maybe you fall in love... :)
For example:
https://developer.gnome.org/glib/stable/glib-String-Utility-Functions.html#g-strdup-printf
explains that g_strdup_printf is:
Similar to the standard C sprintf() function but safer, since it calculates the maximum space required and allocates memory to hold the result.
This isn't really answering your question about the safest libraries to use, but most functions that are vulnerable to buffer overflows that you mentioned have safer versions which take the buffer length as an argument to prevent the security holes that are opened up when the standard methods are used.
Unless you have relaxed the level of warnings, you will usually get compiler warnings when you use the deprecated methods, suggesting you use the safer methods instead.
I believe the Apache Portable Runtime (apr) library is safer than the standard C library. I use it, well, as part of an apache module, but also for independent processes.
For Windows there is a 'safe' C/C++ library.
You're always at liberty to implement any library you like and to use it - the hard part is making sure it is available on the platforms you need your software to work on. You can also use wrappers around the standard functions where appropriate.
Whether it is really a good idea is somewhat debatable, but there is TR24731 published by the C standard committee - for a safer set of C functions. There's definitely some good stuff in there. See this question: Do you use the TR 24731 Safe Functions in your C code?, which includes links to the technical report.
Maybe the first question to ask is if your really need plain C? (maybe a language like .net or java is an option - then e.g. buffer overflows are not really a problem anymore)
Another option is maybe to write parts of your project in C++ if other higher level languages are not an option. You can then have a C interface which encapsulates the C++ code if you really need C.
Because if you add all the advanced functions the C++ standard library has build in - your C code would only be marginally faster most times (and contain a lot more bugs than an existing and tested framework).

Resources