How to safely use c-string in C embedded applications - c

After realizing that String type (with capital S) on Arduino was a big source of troubles (cf. https://hackingmajenkoblog.wordpress.com/2016/02/04/the-evils-of-arduino-strings/), I am trying to deal with c-string to be more safe and robust for embedded applications.
However, I am facing some issues regarding the safety. To illustrate my problem, let's take a function of md5 hashing that will receive a string message, concatenate a private key and then compute and return the hash as a string.
I came to this function:
#define MD5_PRIVATE_KEY "my_private_key"
void ComputeMd5(const char* msg, char* hashBuffer, uint8_t hashBufferSize)
{
if( (hashBuffer == NULL) || (hashBufferSize == 0) )
{
/* INVALID ARG */
*hashBuffer = '\0';
return;
}
if(hashBufferSize <= HASH_SIZE)
{
/* SIZE ERROR */
*hashBuffer = '\0';
return;
}
uint16_t toHashSize = strlen(msg) + strlen(MD5_PRIVATE_KEY) + 1;
char toHash[toHashSize] = "";
strcat(toHash, MD5_PRIVATE_KEY);
strcat(toHash, msg);
strncpy(hashBuffer, MD5(toHash, HASH_SIZE), hashBufferSize);
}
With this function, calls to strlen(msg) and strcat(toHash, msg) are not safe since we don't know the length of msg to use strnlen() and strncat() instead, and we don't even know if msg is a valid null-terminated string.
My question is, would it be a good practice to add the msg length in the prototype such as void ComputeMd5(const char* msg, uint16_t msgSize, char* hashBuffer, uint8_t hashBufferSize) in order to use the 'n' version of strlenand strcat? And, is it acceptable to rely on the caller to provide a valid null-terminated string or is it the responsibility of this function to make check (if yes, how?).
Maybe there is a complete different design to do it and I don't know it (but I still want to avoid dynamic allocation since it's considered as not safe for embedded applications).
Sorry if this isn't very clear, it is still confuse in my head. I'm looking for a discussion about best practices to use c-string in the safer way.
Thanks.

Arduino is not C only C++;
There is no 100% safe way of dealing with pointers and arrays (including null character terminated char arrays called C strings).
Your attempt to make them "safe" is extremely unsafe.
1.
if( (hashBuffer == NULL) || (hashBufferSize == 0) )
{
/* INVALID ARG */
*hashBuffer = '\0';
return;
}
If hashBuffer == NULL it is undefined behaviour.
You do not check if msg is not NULL
You do not check if toHashSize is a reasonable sized.
strncpy is not a safe function. If MD5 function returns a string longer or the same size as hashBufferSize the string will not be null character terminated.

You may have a problem even if the parameters are perfectly valid. For example:
uint16_t toHashSize = strlen(msg) + strlen(MD5_PRIVATE_KEY) + 1;
char toHash[toHashSize] = "";
VLAs may be (and to the best of my knowledge usually are) allocated on stack. How much stack you have, and whether it will fit in it in the case of a particular msg you don't know and don't check (ignoring the initialization issue for a second). So in your attempts to avoid a vulnaribility you just created a new one.
So yes, if you add length to the parameters you could validate that the msg is in fact a null-terminated string of that length, but that's assuming you can trust the caller to provide the correct length. But if you trust the caller to provide a correct length - you should be able to trust them to provide a null-terminated string to begin with, shouldn't you?
Or, just don't treat it as a string but rather as a random binary buffer that you're hashing, in which case length parameter is required and \0 terminator is not. I don't know which MD5 implementation you're using, but you can hash a random binary buffer, it doesn't have to be a string (your MD5 call seems to be expecting a string).

The way to deal with error handling in C (or in any sane program) is to place the error handling as close to the error source as possible. If you suspect that strings are too large or not null terminated for whatever reason, then you need to write code ensuring that they are valid at the point where you receive the strings. Not in some completely unrelated hash function!
Similarly, if you suspect that some pointer is set to NULL for reasons unknown, you need to deal with that at the point where it might occur, not inside some completely unrelated function.
It is also bad practice to place checks like if( (hashBuffer == NULL) || (hashBufferSize == 0) ) inside functions, because it slows down the normal use-case where the caller is passing on perfectly valid data.
In case you do implement a function with extensive error handling, you should let it return a proper error code with an enum. Not by silently setting data to something like *hashBuffer = '\0'; and then merrily move on with the execution - that's not error handling, that's error hiding.
char toHash[toHashSize] = ""; Using VLA is usually not a great idea in embedded systems that might be tight on stack space. If you are sure that this function isn't called often and that your stack has margins, then sure you can do things like this. But it would make far more sense to do static char toHash[MAX_SIZE]=""; since your function always needs to be able with the worst case scenario.
Although in this specific case, what makes the most sense is to do caller allocation. There's no obvious reason why you need to create a local temporary buffer, it just chews up memory and performance. Just access hashBuffer directly without any middle man.
strncpy is a dangerous function that should never be used in C programs, because it wasn't designed to be used with C strings. Detailed explanation here: Is strcpy dangerous and what should be used instead?

Related

Returning a pointer to a static buffer

In C on a small embedded system, is there any reason not to do this:
const char * filter_something(const char * original, const int max_length)
{
static char buffer[BUFFER_SIZE];
// checking inputs for safety omitted
// copy input to buffer here with appropriate filtering etc
return buffer;
}
this is essentially a utility function the source is FLASH memory which may be corrupted, we do a kind of "safe copy" to make sure we have a null terminated string. I chose to use a static buffer and make it available read only to the caller.
A colleague is telling me that I am somehow not respecting the scope of the buffer by doing this, to me it makes perfect sense for the use case we have.
I really do not see any reason not to do this. Can anyone give me one?
(LATER EDIT)
Many thanks to all who responded. You have generally confirmed my ideas on this, which I am grateful for. I was looking for major reasons not to do this, I don't think that there are any. To clarify a few points:
rentrancy/thread safety is not a concern. It is a small (bare metal) embedded system with a single run loop. This code will not be called from ISRs, ever.
in this system we are not short on memory, but we do want very predictable behavior. For this reason I prefer declaring an object like this statically, even though it might be a little "wasteful". We have already had issues with large objects declared carelessly on the stack, which caused intermittent crashes (now fixed but it took a while to diagnose). So in general, I am preferring static allocation, simply to have very predictability, reliability, and less potential issues downstream.
So basically it's a case of taking a certain approach for a specific system design.
Pro
The behavior is well defined; the static buffer exists for the duration of the program and may be used by the program after filter_something returns.
Cons
Returning a static buffer is prone to error because people writing calls to the routines may neglect or be unaware that a static buffer is returned. This can lead to attempts to use multiple instances of the buffer from multiple calls to the function (in the same thread or different threads). Clear documentation is essential.
The static buffer exists for the duration of the program, so it occupies space at times when it may not be needed.
It really depends on how filter_something is used. Take the following as an example
#include <stdio.h>
#include <string.h>
const char* filter(const char* original, const int max_length)
{
static char buffer[1024];
memset(buffer, 0, sizeof(buffer));
memcpy(buffer, original, max_length);
return buffer;
}
int main()
{
const char *strone, *strtwo;
char deepone[16], deeptwo[16];
/* Case 1 */
printf("%s\n", filter("everybody", 10));
/* Case 2 */
printf("%s %s %s\n", filter("nobody", 7), filter("somebody", 9), filter("anybody", 8));
/* Case 2 */
if (strcmp(filter("same",5), filter("different", 10)) == 0)
printf("Strings same\n");
else
printf("Strings different\n");
/* Case 3 - Both of these end up with the same pointer */
strone = filter("same",5);
strtwo = filter("different", 10);
if (strcmp(strone, strtwo) == 0)
printf("Strings same\n");
else
printf("Strings different\n");
/* Case 4 - You need a deep copy if you wish to compare */
strcpy(deepone, filter("same", 5));
strcpy(deeptwo, filter("different", 10));
if (strcmp(deepone, deeptwo) == 0)
printf("Strings same\n");
else
printf("Strings different\n");
}
The output when gcc is used is
everybody
nobody nobody nobody
Strings same
Strings same
Strings different.
When filter is used by itself, it behaves quite well.
When it is used multiple times in an expression, the behaviour is undefined there is no telling what it will do. All instances will use the contents the last time the filter was executed. This depends on the order in which the execution was performed.
If an instance is taken, the contents of the instance will not stay the same as when the instance was taken. This is also a common problem when C++ coders switch to C# or Java.
If a deep copy of the instance is taken, then the contents of the instance when the instance was taken will be preserved.
In C++, this technique is often used when returning objects with the same consequences.
It is true that the identifier buffer only has scope local to the block in which it is declared. However, because it is declared static, its lifetime is that of the full program.
So returning a pointer to a static variable is valid. In fact, many standard functions do this such as strtok and ctime.
The one thing you need to watch for is that such a function is not reentrant. For example, if you do something like this:
printf("filter 1: %s, filter 2: %s\n",
filter_something("abc", 3), filter_something("xyz", 3));
The two function calls can occur in any order, and both return the same pointer, so you'll get the same result printed twice (i.e. the result of whatever call happens to occur last) instead of two different results.
Also, if such a function is called from two different threads, you end up with a race condition with the threads reading/writing the same place.
Just to add to the previous answers, I think the problem, in a more abstract sense, is to make the filtering result broader in scope than it ought to be. You introduce a 'state' which seems useless, at least if the caller's intention is only to get a filtered string. In this case, it should be the caller who should create the array, likely on the stack, and pass it as a parameter to the filtering method. It is the introduction of this state that makes possible all the problems referred to in the preceding responses.
From a program design, it's frowned upon to return pointers to private data, in case that data was made private for a reason. That being said, it's less bad design to return a pointer to a local static then it is to use spaghetti programming with "globals" (external linkage). Particularly when the pointer returned is const qualified.
One general issue with staticvariables, that may or may not be a problem regardless of embedded or hosted system is re-entrancy. If the code needs to be interrupt/thread safe, then you need to implement means to achieve that.
The obvious alternative to it all is caller allocation and you've got to ask yourself why that's not an option:
void filter_something (size_t size, char dest[size], const char original[size]);
(Or if you will, [restrict size] on both pointers for a mini-optimization.)

In C, what is a safer function to use than strtrns?

strtrns has the following descriptions: desc-1, desc-2
The strtrns() function transforms string and copies it into
result.
Any character that appears in old is replaced with the character in
the same position in new. The new result is returned. .........
This function is a security risk because it is possible to overflow
the newString buffer. If the currentString buffer is larger than the
newString buffer, then an overflow will occur.
And this is its prototype( or "signature"? ":
char * strtrns(const char *string, const char *old, const char *new, char *result);
I've been googling to no avail. I appreciate any tips or advice.
I think you can write your own safe one pretty quickly.
It won't be a direct replacement, as the signature is slightly different, and it will allocate memory that the caller must free, but it can serve mostly the same job.
(I'm also changing the parameter name new, which is a reserved word in C++, and the parameter string which is a very common type in C++. These changes makes the function compatible with C++ code as well)
char* alloc_strtrns(const char *srcstr, const char *oldtxt, const char *newtxt)
{
if (strlen(oldtxt) != strlen(newtxt))
{
return NULL; /* Old and New lengths MUST match */
}
char* result = strdup(srcstr); /* TODO: check for NULL */
/* Caller is responsible for freeing! */
return strtrns(srcstr, oldtxt, newtxt, result);
}
The claim that this function is unsafe is nonsense. In C, whenever you have an interface that takes a pointer to a buffer and fills it with some amount of data, you must have a contract between the caller and callee regarding the buffer size. For some functions where the caller cannot know in advance how much data the callee will write, the most logical interface design (contract) is to have the caller pass the buffer size to the callee and have the callee return an error or truncate the data if the buffer is too small. But for functions like strcpy or in your case strtrns where the number of output bytes is a trivial function (like the identity function) of the number of input bytes, it makes perfectly good sense for the contract to simply be that the output buffer provided by the caller must be at least as large as the input buffer.
Anyone who is not comfortable with strict adherence to interface contracts should not be writing C. There is really no way around this; adding complex bounds-checking interfaces certainly does not solve the problem but just shifts around the nature of the contracts you have to follow.
By the way, strtrns is not a standard function anyway so if you'd prefer a different contract anyway you might be better off writing your own similar function. This would increase portability too.
You don't really have any options in C. You simply have to ensure that the destination buffer is large enough.

C: How to copy over null terminator to structure member, in cleaner way?

Essentially I am tokenizing a string and strncpying the string found to a structure member, i.e. stringid. It of course suffers from the problem of lack of termination, I have added an extra array space for it, I've no clue how to add it properly.
I had done it like so:
my_struct[iteration].stringID[ID_SIZE-1] = '\0' //updated
I am unsure if that really works, it looks horrible IMO.
Str(n)cpying a null character, or 0, results in a warning generated by GCC and MinGW:
warning: null argument where non-null required (arg 2)
Am I blind on how to do this in a clean manner? I was thinking of memsetting the member array to all zeros, and then copying the string in to nicely fit with null termination. Do you have any suggestions or practises?
Two things:
Beware that strncpy() has very unexpected semantics, it will always 0-fill the buffer if not totally filled by the string, and it will not terminate the string if it completely fills the buffer. Both of these are weird enough that I recommend against using it.
Never index an array with it's size, like stringID[ID_SIZE] seems to be doing; that is out of bounds.
The best solution is to write a custom version of strncpy() that is less weird, or (if you know the length of the input) just use strcpy().
UPDATE: If the length of your input tokens is static, but they're not 0-terminated in the source buffer due to your tokenization process, then just use memcpy() and manual termination:
const char * token = ...; /* Extract from tokenization somehow. Not 0-terminated. */
const size_t token_length = ... /* Perhaps from tokenization step. */
memcpy(my_struct[iteration].stringID, token, token_length);
my_struct[iteration].stringID[token_length] = '\0';
I don't see a need to "wrap" the above in a macro.
Actually, null terminating the way you suggested isn't horrible at all and I personally very much like it.
The best way, in my opinion, would be to define it as a macro in similar fashion:
// for char* blah;
#define TERMINATE_DYNAMIC_STRING(str, len) str[len] = '\0';
// for char mytext[] = "hello";
#define TERMINATE_STRING(str) str[sizeof(str)/sizeof(str[0]) - 1] = '\0';
Then you can use it all around your code as much as you want.
On Windows Microsoft gives you the following functions which null terminate when copying string: StringCchCopy
As others have noted, strncpy has odd semantics. The idiomatic way to do a bounded string copy is to strncat onto an empty string:
my_struct[iteration].stringID[0] = '\0';
strncat(my_struct[iteration].stringID, src, ID_SIZE-1);
This always appends a terminating NUL, (and fills at most ID_SIZE characters including the NUL).
I ended-up writing a strncpyz(char* pszTo, char* pszTo, size_t lSize) function that forces the NULL termination. This works pretty-well if you have a library to put it in. Using it also requires minimal code changes.
I'm not keen on the macro approach because somebody will pass a pointer to the wrong macro.

How to check for type in C in function

I'm making a function that takes two pointer to strings as arguments.It works fine, as long as you pass it valid arguments.
I wanna know how to check that these pointer are valid and not for example two random ints . How do I do that?
char ** LCS ( char * s1, char * s2) //thats the function
...
LCS(0,0) //...awful crash.. How do I avoid it?
In the body of the function, check:
if ((s1==NULL) || (s2==NULL)) {
/* Do something to indicate bad parameters */
}
With documentation and by following the C motto: "trust the programmer".
/* s1 and s2 must be both valid pointers to null-terminated strings
** otherwise the behaviour is undefined */
char ** LCS ( char * s1, char * s2);
Does it make sense for someone to call your function with NULL arguments? If not, you should disallow NULL arguments in the contract of your function, e.g. by adding a comment above the declaration saying that it only works on valid, non-NULL arguments. In other words, anyone who uses your function agrees not to give NULL arguments; it's then their responsibility to check against this, not yours.
If it does make sense for either or both of the arguments to be NULL, then you need to decide on how your function behaves in that case and implement it thus. In this case you are agreeing to support NULL arguments and do something sensible with them, and therefore it becomes your responsibility to check for this and act accordingly (e.g. if (s1 == NULL)).
If you cannot think of any sensible behaviour for NULL arguments, then go with the first option and disallow them altogether. If you do this, then your example call LCS(0,0); is in breach of contract (i.e. passes NULL pointers when the function does not agree to accept them) and should be removed. In a more complex scenario if you are passing the arguments from variables and there is a chance that those variables point to NULL, then you must check before calling LCS, e.g. if (v1 && v2) { LCS(v1,v2); } else { … }.
To track possible errors relating to this, you could use assert to check, e.g.:
#include <assert.h>
char **LCS (char *s1, char *s2) {
assert(s1);
assert(s2);
…
}
This will cause your program to exit if s1 or s2 is NULL, unless NDEBUG was defined before including assert.h (in which case the assertions do nothing). So the assertions are a way to check, during development, that the caller is not giving you NULL arguments but it's still an error if they do.
As for other invalid pointers, you cannot really even check reliably, e.g. there's no way of knowing whether the caller has a really strange string or if they just passed the wrong address. This, too, is their responsibility to avoid, and LCS should simply assume that the caller is giving you valid data. Of course if you have additional restrictions, e.g. maximum length of the argument strings, then you must make these restrictions clear to the caller (i.e. specify the contract for the function, “this function does X [your responsibility as the implementor of LCS] provided that … [their responsibilities as the user of LCS]”). This applies to all programming, for example the C standard specifies how the language itself and the standard library functions must be used (e.g. cannot divide by zero, argument strings for strcpy cannot overlap, etc).
In C, I'm afraid you have to just be careful and hope the programmers know what to do.
In this case, 0 (zero, null, NULL) is valid input for the function.
Normally is that case, you would at least protect the function by checking if the input is valid.
for example ...
char** LCS (char *s1, char *s2 )
{
if ( s1 == 0 )
return ...;
if ( s2 == 0 )
return ...;
if ( strlen( s1 ) == 0 )
return ...
/// do something ...
}
The best you can do is check against NULL (0). Otherwise, there's no standard way to tell whether a non-NULL pointer value is valid. There may be some platform-specific hacks available, but in general this problem is dealt with by documentation and good memory management hygiene.
You can implement your own type checking using a struct like this. But you could also just use a language with proper type checking. :)
typedef struct Var {
enum Type { int, ptr, float ... } type;
union {
int Int;
void *Ptr;
float Float;
...
} data;
} Var;
The ideology of C revolves around the principle that 'The programmer knows what (s)he is doing.' Half the reason as to why C is so lightweight and speedy, is because it doesn't perform such type checks.
If you really need to perform such checks, you might be better off in C++, using references (which are assured to be non-null) instead of pointers.
First, as everyone else has said:
Check for NULL parameters
Everything else is heuristics and careful programming
You can provide a function prototype to your callers and turn the warnings to 11: (at least -Werror -Wall and -Wextra for gcc). This will cause a compilation error if a parameter of an improper type is passed in. It doesn't help if the caller first casts his parameters to char *s (e.g. LCS((char*)1, (char*)1 )
You could call strlen on your arguments, but if the values are non-NULL but still illegal values, the strlen could crash.
You could attempt to see if the pointers are in valid segments for the program. This is not portable, and still not foolproof.
So, to summarize, check for NULL, and turn the warnings to 11. This is what's done in practice.
You can't really do this. First, if a programmer passes in arbitrary integers cast as pointers then they may actually be valid pointers within your address space -- they might even point to null terminated character arrays (in fact, if they are within your address space they will point because the data there will be treated as characters and at some point there will be a 0 byte).
You can test for several invalid (for applications, anyway) pointer values, including NULL and maybe even any value that would point to the first page of the processes address space (it is usually not mapped and can safely be assumed not to be valid). On some systems there are other pages that are not ever mapped (like the last page). Some systems also have ways ask about the memory map of a process (/proc/self/maps under Linux) which you could (with a lot of trouble) look at and see if the pointer was within a mapped area with the appropriate access.
If you are using a *nix system something you could do would be to register a signal handler for SIGSEGV which gets raised when your program tries to access memory that it shouldn't be accessing. Then you could catch that and with some work figure out what has happened. Another thing you could do would be to call a system call that takes a pointer and use the pointers you have been passed as arguments and see if it fails (with errno == EFAULT). This is probably not good since system calls do things besides just testing memory for read and/or write permissions. You could always write the first byte pointed to by a pointer to /dev/null or /dev/zero (using the write system call, not stdio functions) to determine if you have read permissions, but if you read a byte from /dev/zero or /dev/random into the first byte pointed to (using the read system calls, not stdio functions), but if the data at that area is important then you would have over written a byte of it. If you were to have tried to save a copy of that data into a local variable so that you could restore it after the test then you might have caused an error when you read from it within your program. You could get elaborate and write it out and then read it back in to test both access rights, though, but this is getting complicated.
Your best bet is to just rely on the user of your function to do the right thing.

Disabling NUL-termination of strings in GCC

Is it possible to globally disable NUL-terminated strings in GCC?
I am using my own string library, and I have absolutely no need for the final NUL characters as it already stores the proper length internally in a struct.
However, if I wanted to append 10 strings, this would mean that 10 bytes are unnecessarily allocated on the stack. With wide strings it is even worse: As for x86, there are 40 bytes wasted; and for x86_64, 80 bytes!
I defined a macro to add those stack-allocated strings to my struct:
#define AppendString(ppDest, pSource) \
AppendSubString(ppDest, (*ppDest)->len + 1, pSource, 0, sizeof(pSource) - 1)
Using sizeof(...) - 1 works quite well but I am wondering whether I could get rid of NUL termination in order to save a few bytes?
This is pretty awful, but you can explicitly specify the length of every character array constant:
char my_constant[6] = "foobar";
assert(sizeof my_constant == 6);
wchar_t wide_constant[6] = L"foobar";
assert(sizeof wide_constant == 6*sizeof(wchar_t));
I understand you're only dealing with strings declared in your program:
....
char str1[10];
char str2[12];
....
and not with text buffers you allocate with malloc() and friends otherwise sizeof is not going to help you.
Anyway, i would just think twice about removing the \0 at the end: you would lose the compatibility with C standard library functions.
Unless you are going to rewrite any single string function for your library (sprintf, for example), are you sure you want to do it?
I can't remember the details, but when I do
char my_constant[5]
it is possible that it will reserve 8 bytes anyway, because some machines can't address the middle of a word.
It's nearly always best to leave this sort of thing to the compiler and let it handle the optmisation for you, unless there is a really really good reason to do so.
If you're not using any of the Standard Library function that deal with strings you can forget about the NUL terminating byte.
No strlen(), no fgets(), no atoi(), no strtoul(), no fopen(), no printf() with the %s conversion specifier ...
Declare your "not quite C strings" with just the needed space;
struct NotQuiteCString { /* ... */ };
struct NotQuiteCString variable;
variable.data = malloc(5);
data[0] = 'H'; /* ... */ data[4] = 'o'; /* "hello" */
Indeed this is only in case you are really low in memory. Otherwise I don't recommend to do so.
It seems most proper way to do thing you are talking about is:
To prepare some minimal 'listing' file in a form of:
string1_constant_name "str1"
string2_constant_name "str2"
...
To construct utility which processes your file and generates declarations such as
const char string1_constant[4] = "str1";
Of course I'd not recommend to do this by hands, because otherwise you can get in trouble after any string change.
So now you have both non-terminated strings because of fixed auto-generated arrays and also you have sizeof() for every variable. This solution seems acceptable.
Benefits are easy localization, possibility to add some level of checks to make this solution risk lower and R/O data segment savings.
Drawback is need to include all of such string constants in every module (as include to keep sizeof() known). So this only makes sense if your linker merges such symbols (some don't).
Aren't these similar to Pascal-style strings, or Hollerith Strings? I think this is only useful if you actually want the String data to preserve NULLs, in which you're really pushing around arbitrary memory, not "strings" per se.
The question uses false assumptions - it assumes that storing the length (e.g. implicitly by passing it as a number to a function) incurs no overhead, but that's not true.
While one might save space by not storing the 0-byte (or wchar), the size must be stored somewhere, and the example hints that it is passed as a constant argument to a function somewhere, which almost certainly takes more space, in code. If the same string is used multiple times, the overhead is per use, not per-string.
Having a wrapper that uses strlen to determine the length of a string and isn't inlined will almost certainly save more space.

Resources