Disadvantages of strlen() in Embedded Domain - c

What are the disadvantages of using strlen()?
If sometimes in TCP Communication NULL character comes in string than we find length of string up to only null character.
we cant find actual length of string.
if we make other alternative of this strlen function than its also stops at NULL character. so which method i can use to find out string length in C

To read from a "TCP Communication" you are probably using read. The prototype for read is
ssize_t read(int fildes, void *buf, size_t nbyte);
and the return value is the number of bytes read (even if they are 0).
So, let's say you're about to read 10 bytes, all of which are 0. You have an array with more than enough to hold all the data
int fildes;
char data[1000];
// fildes = TCPConnection
nbytes = read(fildes, data, 1000);
Now, by inspecting nbytes you know you have read 10 bytes. If you check data[0] through data[9] you will find they have 0;

If the runtime library provides strcpy() and strcat(), then surely it provides strlen().
I suspect you are confusing NULL, an invalid pointer value, from the ASCII code NUL, for a zero character value which indicates the end of string to many C runtime functions.
As such, there is no concern for inserting a NUL value in a string, nor in detecting it.
Response to updated question:
Since you seem to be processing binary data, the string functions are not a good fit—unless you can guarantee there are no NULs in the stream. However, for this reason, most TCP/IP messages use headers with fields containing the number of bytes which follow.

"Embedded" strikes me as a red herring here.
If you're processing binary data where an embedded NUL might be valid, then you can't expect meaningful results from strlen.
If you're processing strings (as that term is defined in C -- a block of non-NUL data terminated by a NUL) then you can use strlen just fine.
A system being "embedded" would affect this only to the degree that it might be less common to process strings and more common to process binary data.

it is safer to use strnlen instead of strlen to avoid the problems with strlen. The strlen problems are present everywhere not just embedded. Many of the string functions are dangerous because they go forever or until a zero is hit, or like scanf or strtok go until a pattern is hit.
Remember tcp is a stream not a packet, you may have to wait for multiple or many packets and piece together the data before you can attempt to call it a string anyway. that is assuming the payload is an asciiz string anyway, if raw data then dont use string functions use some other solution.

Yes, strlen() uses a terminating character \0 aka NUL. Most str* functions do so. There could be a risk that data coming from files/command line/sockets would not contain this character (usually, they won't: they'll be \n-terminated), but their size will also be provided by the read()/recv() function you've used. If that's a concern, you can always use a buffer slightly larger than what declared to those functions, e.g.
char mybuf[256+4];
mybuf[256]=0;
int reallen=fgets(mybuf, 256, stdin);
// we've got a 0-terminated string in mybuf.
If your data may not contain \0, compare strlen(mybuf) with reallen and terminate the session with an error code if they differ.
If your data may contain 0, then it should be processed as a buffer and not as a string. Size must be kept aside, and memcpy / memcmp functions should be used instead of strcpy and strcmp.
Also, your network protocol should be very explicit on whether strings or binary data is expected in different parts of the communication. HTTP is for instance, and it provides many way to tell the actual size of the transmitted payload.
This isn't specific to "embedded" programs, but it has come a major concern in every programs to ensure no remote code/script injection can occur. If by "embedded", you mean you're in a non-preemptive environment and have only limited time available to perform some action ... then yeah, you don't want to end up scanning 2GB of incoming bits for a (never-appearing) \0. either the above trick, or strnlen (mentioned in another answer) could be used to ensure this isn't the case.

Related

Can a C implementation use length-prefixed-strings "under the hood"?

After reading this question: What are the problems of a zero-terminated string that length-prefixed strings overcome? I started to wonder, what exactly is stopping a C implementation from allocating a few extra bytes for any char or wchar_t array allocated on the stack or heap and using them as a "string prefix" to store the number N of its elements?
Then, if the N-th character is '\0', N - 1 would signify the string length.
I believe this could mightily boost performance of functions such as strlen or strcat.
This could potentially turn to extra memory consumption if a program uses non-0-terminated char arrays extensively, but that could be remedied by a compiler flag turning on or off the regular "count-until-you-reach-'\0'" routine for the compiled code.
What are possible obstacles for such an implementation? Does the C Standard allow for this? What problems can this technique cause that I haven't accounted for?
And... has this actually ever been done?
You can store the length of the allocation. And malloc implementations really do do that (or some do, at least).
You can't reasonably store the length of whatever string is stored in the allocation, though, because the user can change the contents to their whim; it would be unreasonable to keep the length up to date. Furthermore, users might start strings somewhere in the middle of the character array, or might not even be using the array to hold a string!
Then, if the N-th character is '\0', N - 1 would signify the string length.
Actually, no, and that's why this suggestion cannot work.
If I overwrite a character in a string with a 0, I have effectively truncated the string, and a subsequent call of strlen on the string must return the truncated length. (This is commonly done by application programs, including every scanner generated by (f)lex, as well as the strtok standard library function. Amongst others.)
Moreover, it is entirely legal to call strlen on an interior byte of the string.
For example (just for demonstration purposes, although I'll bet you can find code almost identical to this in common use.)
/* Split a string like 'key=value...' into key and value parts, and
* return the value, and optionally its length (if the second argument
* is not a NULL pointer).
* On success, returns the value part and modifieds the original string
* so that it is the key.
* If there is no '=' in the supplied string, neither it nor the value
* pointed to by plen are modified, and NULL is returned.
*/
char* keyval_split(char* keyval, int* plen) {
char* delim = strchr(keyval, '=');
if (delim) {
if (plen) *plen = strlen(delim + 1)
*delim = 0;
return delim + 1;
} else {
return NULL;
}
}
There's nothing fundamentally stopping you from doing this in your application, if that was useful (one of the comments noted this). There are two problems that would emerge, however:
You'd have to reimplement all the string-handling functions, and have my_strlen, my_strcpy, and so on, and add string-creating functions. That might be annoying, but it's a bounded problem.
You'd have to stop people, when writing for the system, deliberately or automatically treating the associated character arrays as ‘ordinary’ C strings, and using the usual functions on them. You might have to make sure that such usages broke promptly.
This means that it would, I think, be infeasible to smuggle a reimplemented ‘C string’ into an existing program.
Something like
typedef struct {
size_t len;
char* buf;
} String;
size_t my_strlen(String*);
...
might work, since type-checking would frustrate (2) (unless someone decided to hack things ‘for efficiency’, in which case there's not much you can do).
Of course, you wouldn't do this unless and until you'd proven that string management was the bottleneck in your code and that this approach provably improved things....
There are a couple of issues with this approach. First of all, you wouldn't be able to create arbitrarily long strings. If you only reserve 1 byte for length, then your string can only go up to 255 characters. You can certainly use more bytes to store the length, but how many? 2? 4?
What if you try to concatenate two strings that are both at the edge of their size limits (i.e., if you use 1 byte for length and try to concatenate two 250-character strings to each other, what happens)? Do you simply add more bytes to the length as necessary?
Secondly, where do you store this metadata? It somehow has to be associated with the string. This is similar to the problem Dennis Ritchie ran into when he was implementing arrays in C. Originally, array objects stored an explicit pointer to the first element of the array, but as he added struct types to the language, he realized that he didn't want that metadata cluttering up the representation of the struct object in memory, so he got rid of it and introduced the rule that array expressions get converted to pointer expressions in most circumstances.
You could create a new aggregate type like
struct string
{
char *data;
size_t len;
};
but then you wouldn't be able to use the C string library to manipulate objects of that type; an implementation would still have to support the existing interface.
You could store the length in the leading byte or bytes of the string, but how many do you reserve? You could use a variable number of bytes to store the length, but now you need a way to distinguish length bytes from content bytes, and you can't read the first character by simply dereferencing the pointer. Functions like strcat would have to know how to step around the length bytes, how to adjust the contents if the number of length bytes changes, etc.
The 0-terminated approach has its disadvantages, but it's also a helluva lot easier to implement and makes manipulating strings a lot easier.
The string methods in the standard library have defined semantics. If one generates an array of char that contains various values, and passes a pointer to the array or a portion thereof, the methods whose behavior is defined in terms of NUL bytes must search for NUL bytes in the same fashion as defined by the standard.
One could define one's own methods for string handling which use a better form of string storage, and simply pretend that the standard library string-related functions don't exist unless one must pass strings to things like fopen. The biggest difficulty with such an approach is that unless one uses non-portable compiler features it would not be possible to use in-line string literals. Instead of saying:
ns_output(my_file, "This is a test"); // ns -- new string
one would have to say something more like:
MAKE_NEW_STRING(this_is_a_test, "This is a test");
ns_output(my_file, this_is_a_test);
where the macro MAKE_NEW_STRING would create a union of an anonymous type, define an instance called this_is_a_test, and suitably initialize it. Since a lot of strings would be of different anonymous types, type-checking would require that strings be unions that include a member of a known array type, and code expecting strings should be given a pointer that member, likely using something like:
#define ns_output(f,s) (ns_output_func((f),(s).stringref))
It would be possible to define the types in such a way as to avoid the need for the stringref member and have code just accept void*, but the stringref member would essentially perform static duck-typing (only things with a stringref member could be given to such a macro) and could also allow type-checking on the type of stringref itself).
If one could accept those constraints, I think one could probably write code that was more efficient in almost every way that zero-terminated strings; the question would be whether the advantages would be worth the hassle.

'strncpy' vs. 'sprintf'

I can see many sprintf's used in my applications for copying a string.
I have a character array:
char myarray[10];
const char *str = "mystring";
Now if I want want to copy the string str into myarray, is is better to use:
sprintf(myarray, "%s", str);
or
strncpy(myarray, str, 8);
?
Neither should be used, at all.
sprintf is dangerous, deprecated, and superseded by snprintf. The only way to use the old sprintf safely with string inputs is to either measure their length before calling sprintf, which is ugly and error-prone, or by adding a field precision specifier (e.g. %.8s or %.*s with an extra integer argument for the size limit). This is also ugly and error-prone, especially if more than one %s specifier is involved.
strncpy is also dangerous. It is not a buffer-size-limited version of strcpy. It's a function for copying characters into a fixed-length, null-padded (as opposed to null-terminated) array, where the source may be either a C string or a fixed-length character array at least the size of the destination. Its intended use was for legacy unix directory tables, database entries, etc. that worked with fixed-size text fields and did not want to waste even a single byte on disk or in memory for null termination. It can be misused as a buffer-size-limited strcpy, but doing so is harmful for two reasons. First of all, it fails to null terminate if the whole buffer is used for string data (i.e. if the source string length is at least as long as the dest buffer). You can add the termination back yourself, but this is ugly and error-prone. And second, strncpy always pads the full destination buffer with null bytes when the source string is shorter than the output buffer. This is simply a waste of time.
So what should you use instead?
Some people like the BSD strlcpy function. Semantically, it's identical to snprintf(dest, destsize, "%s", source) except that the return value is size_t and it does not impose an artificial INT_MAX limit on string length. However, most popular non-BSD systems lack strlcpy, and it's easy to make dangerous errors writing your own, so if you want to use it, you should obtain a safe, known-working version from a trustworthy source.
My preference is to simply use snprintf for any nontrivial string construction, and strlen+memcpy for some trivial cases that have been measured to be performance-critical. If you get in a habit of using this idiom correctly, it becomes almost impossible to accidentally write code with string-related vulnerabilities.
The different versions of printf/scanf are incredibly slow functions, for the following reasons:
They use variable argument lists, which makes parameter passing more complex. This is done through various obscure macros and pointers. All the arguments have to be parsed in runtime to determine their types, which adds extra overhead code. (VA lists is also quite a redundant feature of the language, and dangerous as well, as it has farweaker typing than plain parameter passing.)
They must handle a lot of complex formatting and all different types supported. This adds plenty of overhead to the function as well. Since all type evaluations are done in runtime, the compiler cannot optimize away parts of the function that are never used. So if you only wanted to print integers with printf(), you will get support for float numbers, complex arithmetic, string handling etc etc linked to your program, as complete waste of space.
Functions like strcpy() and particularly memcpy() on the other hand, are heavily optimized by the compiler, often implemented in inline assemble for maximum performance.
Some measurements I once made on barebone 16-bit low-end microcontrollers are included below.
As a rule of thumb, you should never use stdio.h in any form of production code. It is to be considered as a debugging/testing library. MISRA-C:2004 bans stdio.h in production code.
EDIT
Replaced subjective numbers with facts:
Measurements of strcpy versus sprintf on target Freescale HCS12, compiler Freescale
Codewarrior 5.1. Using C90 implementation of sprintf, C99 would be more ineffective yet. All optimizations enabled. The following code was tested:
const char str[] = "Hello, world";
char buf[100];
strcpy(buf, str);
sprintf(buf, "%s", str);
Execution time, including parameter shuffling on/off call stack:
strcpy 43 instructions
sprintf 467 instructions
Program/ROM space allocated:
strcpy 56 bytes
sprintf 1488 bytes
RAM/stack space allocated:
strcpy 0 bytes
sprintf 15 bytes
Number of internal function calls:
strcpy 0
sprintf 9
Function call stack depth:
strcpy 0 (inlined)
sprintf 3
I would not use sprintf just to copy a string. It's overkill, and someone who reads that code would certainly stop and wonder why I did that, and if they (or I) are missing something.
There is one way to use sprintf() (or if being paranoid, snprintf() ) to do a "safe" string copy, that truncates instead of overflowing the field or leaving it un-NUL-terminated.
That is to use the "*" format character as "string precision" as follows:
So:
char dest_buff[32];
....
sprintf(dest_buff, "%.*s", sizeof(dest_buff) - 1, unknown_string);
This places the contents of unknown_string into dest_buff allowing space for the terminating NUL.

Serializing strings in C

I'm serializing structs into byte-streams. My method is simple:
pack all ints in little endian order and copy strings including the null terminator. The other side has to statically know how to unpack the byte-stream, there is no additional metadata.
My problem is, that I do not know how to handle the the NULL pointer?
I need to send something, because there is no additional metadata in the stream.
I considered the following two options:
Send a '\0' and make the receiving side interpret it as NULL in any case
Send a '\0' and make the receiving side interpret it as '\0' in any case (alloc a byte)
Send a special character representing char* str == NULL, e.g. ETX, EOT, EM ?
What do you think?
It looks like you are currently trying to tell the receiving end that the end of the serialized string has been reached by passing it a special character. There are a million cases that can screw you over with this:
What if your struct contains a byte that is equal to that special character. Escape it with another special character. What if your struct contains a byte sequence that is equal to your escape character followed by your special character, check for that too?
Yeah it's doable, but I think that's not a very good solution and you'll have to write a parser to look for the escape character and then anyone who takes a look at the code later will spend two hours trying to figure out what's going on.
(tl;dr) Instead... just make the first 32 bits of the serialized string equal to the number of bytes in the string. This only costs 4 bytes per serialization, solves all your problems, you won't have to write a parser or worry about special characters, and will make it a lot easier on the next guy who gets to read through your code!
edit
Thanks to JeremyP I've just realized that I didn't really answer your question. Send one of these guys for every string:
struct s_str
{
bool is_null;
int size;
char* str;
};
If it's null, simply set is_null to true and you don't really have to worry about the other two.
If it's size zero, set is_null to false and size to zero.
If str contains just a '\0', set is_null to false, size to one, and str[0] to '\0'
In my opinion, this might not be the most memory efficient way (you could probably save a byte somewhere somehow) but is definitely quite clear in what you're doing, and again the next guy that comes along will like this a lot more.
Do not do this. use some extra bytes to store length and concatenate with your data string. The receiver end can check the length to know how much it should read into his local buffer.
It depends on the significance of the pointer in your protocol.
If the pointer is significant, i.e. it is needed for the recipient to know how to rebuild the struct, then you need to send something. It could be either a byte with 0/non-zero to indicate existence, or an integer that indicates the number of bytes pointed to by the pointer.
Example:
struct Foo {
int *arr,
char *text
}
Struct Foo could be serialized like this:
<arr length>< arr ><text length>< text >
4 bytes n bytes 4 bytes n bytes
I would advise you to use an existing library for serialization. I can think of two at the moment: tpl and gwlib's gwser.
About tpl:
You can use tpl to store and reload your C data quickly and easily.
Tpl works with files, memory buffers and file descriptors so it's
suitable for use as a file format, IPC message format or any scenario
where you need to store and retrieve your data.
About gwlib, see the link, it's not very verbose, but it provides a few usage examples.

Getting the length of a formatted string from wsprintf

When using standard char* strings, the snprintf and vsnprintf functions will return the length of the output string, even if that string was truncated due to overflow.* It seems like the ISO C committee didn't like this functionality when they added swprintf and vswprintf, which return -1 on overflow.
Does anyone know of a function that will provide this length? I don't know the size of the potential strings. I might be asking too much, but.. I'd rather not:
allocate a huge static temp buffer
iteratively allocate and free memory until i've found a size that fits
add an additional library dependency
write my own format string parser
*I realize MSVC doesn't do this, and instead provides the scprintf and vscprintf functions, but I'm looking for other compilers, mainly GCC.
My best suggestion to you would be not to use wchar_t strings at all, especially if you're not writing Windows-oriented code. In case that's not an option, here are some other ideas:
If your format string does not contain non-ASCII characters itself, what about first calling vsnprintf with the same set of arguments to get the length in bytes, then use that as a safe upper bound for the length in wchar_t characters (if there are few or non-ASCII characters, the bound will be tight).
If you're okay with introducing a dependency on a POSIX function (which is likely to be added to C1x), use open_wmemstream and fwprintf.
Just iterate allocating larger buffers, but do it smart: increase the size geometrically at each step, e.g. 127, 255, 511, 1023, 2047, ... I like this pattern better than whole powers of 2 because it's easy to avoid dangerous case where allocation might succeed for SIZE_MAX/2+1 but then wrap to 0 at the next iteration.
This returns the buffer size for wide character strings:
vswprintf(nullptr, -1, aFormat, argPtr);

What makes a C standard library function dangerous, and what is the alternative?

While learning C I regularly come across resources which recommend that some functions (e.g. gets()) are never to be used, because they are either difficult or impossible to use safely.
If the C standard library contains a number of these "never-use" functions, it would seem necessary to learn a list of them, what makes them unsafe, and what to do instead.
So far, I've learned that functions which:
Cannot be prevented from overwriting memory
Are not guaranteed to null-terminate a string
Maintain internal state between calls
are commonly regarded as being unsafe to use. Is there a list of functions which exhibit these behaviours? Are there other types of functions which are impossible to use safely?
In the old days, most of the string functions had no bounds checking. Of course they couldn't just delete the old functions, or modify their signatures to include an upper bound, that would break compatibility. Now, for almost every one of those functions, there is an alternative "n" version. For example:
strcpy -> strncpy
strlen -> strnlen
strcmp -> strncmp
strcat -> strncat
strdup -> strndup
sprintf -> snprintf
wcscpy -> wcsncpy
wcslen -> wcsnlen
And more.
See also https://github.com/leafsr/gcc-poison which is a project to create a header file that causes gcc to report an error if you use an unsafe function.
Yes, fgets(..., ..., STDIN) is a good alternative to gets(), because it takes a size parameter (gets() has in fact been removed from the C standard entirely in C11). Note that fgets() is not exactly a drop-in replacement for gets(), because the former will include the terminating \n character if there was room in the buffer for a complete line to be read.
scanf() is considered problematic in some cases, rather than straight-out "bad", because if the input doesn't conform to the expected format it can be impossible to recover sensibly (it doesn't let you rewind the input and try again). If you can just give up on badly formatted input, it's useable. A "better" alternative here is to use an input function like fgets() or fgetc() to read chunks of input, then scan it with sscanf() or parse it with string handling functions like strchr() and strtol(). Also see below for a specific problem with the "%s" conversion specifier in scanf().
It's not a standard C function, but the BSD and POSIX function mktemp() is generally impossible to use safely, because there is always a TOCTTOU race condition between testing for the existence of the file and subsequently creating it. mkstemp() or tmpfile() are good replacements.
strncpy() is a slightly tricky function, because it doesn't null-terminate the destination if there was no room for it. Despite the apparently generic name, this function was designed for creating a specific style of string that differs from ordinary C strings - strings stored in a known fixed width field where the null terminator is not required if the string fills the field exactly (original UNIX directory entries were of this style). If you don't have such a situation, you probably should avoid this function.
atoi() can be a bad choice in some situations, because you can't tell when there was an error doing the conversion (e.g., if the number exceeded the range of an int). Use strtol() if this matters to you.
strcpy(), strcat() and sprintf() suffer from a similar problem to gets() - they don't allow you to specify the size of the destination buffer. It's still possible, at least in theory, to use them safely - but you are much better off using strncat() and snprintf() instead (you could use strncpy(), but see above). Do note that whereas the n for snprintf() is the size of the destination buffer, the n for strncat() is the maximum number of characters to append and does not include the null terminator. Another alternative, if you have already calculated the relevant string and buffer sizes, is memmove() or memcpy().
On the same theme, if you use the scanf() family of functions, don't use a plain "%s" - specify the size of the destination e.g. "%200s".
strtok() is generally considered to be evil because it stores state information between calls. Don't try running THAT in a multithreaded environment!
Strictly speaking, there is one really dangerous function. It is gets() because its input is not under the control of the programmer. All other functions mentioned here are safe in and of themselves. "Good" and "bad" boils down to defensive programming, namely preconditions, postconditions and boilerplate code.
Let's take strcpy() for example. It has some preconditions that the programmer must fulfill before calling the function. Both strings must be valid, non-NULL pointers to zero terminated strings, and the destination must provide enough space with a final string length inside the range of size_t. Additionally, the strings are not allowed to overlap.
That is quite a lot of preconditions, and none of them is checked by strcpy(). The programmer must be sure they are fulfilled, or he must explicitly test them with additional boilerplate code before calling strcpy():
n = DST_BUFFER_SIZE;
if ((dst != NULL) && (src != NULL) && (strlen(dst)+strlen(src)+1 <= n))
{
strcpy(dst, src);
}
Already silently assuming the non-overlap and zero-terminated strings.
strncpy() does include some of these checks, but it adds another postcondition the programmer must take care for after calling the function, because the result may not be zero-terminated.
strncpy(dst, src, n);
if (n > 0)
{
dst[n-1] = '\0';
}
Why are these functions considered "bad"? Because they would require additional boilerplate code for each call to really be on the safe side when the programmer assumes wrong about the validity, and programmers tend to forget this code.
Or even argue against it. Take the printf() family. These functions return a status that indicate error and success. Who checks if the output to stdout or stderr succeeded? With the argument that you can't do anything at all when the standard channels are not working. Well, what about rescuing the user data and terminating the program with an error-indicating exit code? Instead of the possible alternative of crash and burn later with corrupted user data.
In a time- and money-limited environment it is always the question of how much safety nets you really want and what is the resulting worst case scenario? If it is a buffer overflow as in case of the str-functions, then it makes sense to forbid them and probably provide wrapper functions with the safety nets already within.
One final question about this: What makes you sure that your "good" alternatives are really good?
Any function that does not take a maximum length parameter and instead relies on an end-of- marker to be present (such as many 'string' handling functions).
Any method that maintains state between calls.
sprintf is bad, does not check size, use snprintf
gmtime, localtime -- use gmtime_r, localtime_r
To add something about strncpy most people here forgot to mention. strncpy can result in performance problems as it clears the buffer to the length given.
char buff[1000];
strncpy(buff, "1", sizeof buff);
will copy 1 char and overwrite 999 bytes with 0
Another reason why I prefer strlcpy (I know strlcpy is a BSDism but it is so easy to implement that there's no excuse to not use it).
View page 7 (PDF page 9) SAFECode Dev Practices
Edit: From the page -
strcpy family
strncpy family
strcat family
scanf family
sprintf family
gets family
strcpy - again!
Most people agree that strcpy is dangerous, but strncpy is only rarely a useful replacement. It is usually important that you know when you've needed to truncate a string in any case, and for this reason you usually need to examine the length of the source string anwyay. If this is the case, usually memcpy is the better replacement as you know exactly how many characters you want copied.
e.g. truncation is error:
n = strlen( src );
if( n >= buflen )
return ERROR;
memcpy( dst, src, n + 1 );
truncation allowed, but number of characters must be returned so caller knows:
n = strlen( src );
if( n >= buflen )
n = buflen - 1;
memcpy( dst, src, n );
dst[n] = '\0';
return n;

Resources