strncmp proper usage - c

Here's the quick background: I've got a client and a server program that are communicating with each other over a Unix socket. When parsing the received messages on the server side, I am attempting to use strncmp to figure out what action to take.
The problem I'm having is figuring out exactly what to use for the length argument of strncmp. The reason this is being problematic is that some of my messages share a common prefix. For example, I have a message "getPrimary", which causes the server to respond with a primary server address, and a message "getPrimaryStatus", which causes the server to respond with the status of the primary server. My initial thought was to do the following:
if(strncmp(message,"getPrimary",strlen("getPrimary"))==0){
return foo;
}
else if(strncmp(message,"getPrimaryStatus",strlen("getPrimaryStatus"))==0){
return bar;
}
The problem with this is when I send the server "getPrimaryStatus", the code will always return foo because strncmp is not checking far enough in the string. I could pass in strlen(message) as the length argument to strncmp, but this seems to defeat the purpose of using strncmp, which is to prevent overflow in the case of unexpected input. I do have a static variable for the maximum message length I can read, but it seems like passing this in as the length is only making sure that if the message overflows, the effects are minimized.
I've come up with a few solutions, but they aren't very pretty, so I was wondering if there was a common way of dealing with this problem.
For reference, my current solutions are:
Order my if / else if statements in such a way that that any messages with common prefixes are checked in order of descending length (which seems like a really good way to throw a landmine in my code for anyone trying to add something to it later on).
Group my messages with common prefixes together and look for the suffix first:
if(strncmp(message,"getPrimary",strlen("getPrimary"))==0){
if(strncmp(message,"getPrimaryStatus",strlen("getPrimaryStatus"))==0){
return bar;
else
return foo;
}
}
But this just feels messy, especially since I have about 20 different possible messages that I'm handling.
Create an array of all the possible messages I have, add a function to my init sequence that will order the array by descending length, and have my code search through the elements of that list until it finds a match. This seems complicated and silly.
It seems like this should be a common enough issue that there ought to be a solution for it somewhere, but I haven't been able to find anything so far.
Thanks in advance for the help!

Presuming that the string in message is supposed to be null-terminated, the only reason to use strncmp() here rather than strcmp() would be to be to prevent it looking beyond the end of message, in the case where message is not null-terminated.
As such, the n you pass to strncmp() should be the received size of message, which you ought to know (from the return value of the read() / recv() function that read the message).

One technique is to compare the longest names first - order the tests (or the table containing the keywords) so that the longer names precede the shorter. However, taking your example:
GetPrimaryStatus
GetPrimary
You probably want to ensure that GetPrimaryIgnition is not recognized as GetPrimary. So you really need to compare using the length of the longer of the two strings - the message or the keyword.
Your data structure here might be:
static const struct
{
char *name;
size_t name_len;
int retval;
} Messages[] =
{
{ "getPrimaryStatus", sizeof("getPrimaryStatus"), CMD_PRIMARYSTATUS },
{ "getPrimary", sizeof("getPrimary"), CMD_PRIMARY },
...
};
You can then loop through this table to find the relevant command. With some care you can limit the range that you have to look at. Note that the sizeof() values include the NUL at the end of the string. This is useful if you can null terminate your message
However, it is by far simplest if you can null terminate the command word in the message, either by copying the message somewhere or by modifying the message in situ. You then use strcmp() instead of strncmp(). A Shortest Unique Prefix lookup is harder to code.
One plausible way of finding the command word is with strcspn() - assuming your commands are all alphabetic or alphanumeric.

I sense that you are using strncmp to prevent buffer overflow, however, the message is already copied into memory (i.e. the message buffer). Also, the prototype
int strncmp ( const char * str1, const char * str2, size_t num );
indicates that the function has no side effects (i.e. it does not change either input buffer) so there should be no risk that it will overwrite the buffer and change memory. (This is not the case for strcpy(). )
You could make sure that the length of your message buffer is longer than your longest command string. That way you are sure that you are always accessing memory that you own.
Also, if you insist on using strncmp you could store your list of commands in an array and sort it from largest to smallest. You could associate each string with a length (and possibly a function pointer to execute a handler).
Finally, you could find a C version of what C++ calls a map or what Ruby or PHP call associative arrays. This lets library handle this if-else tree for you efficiently and correctly.

Don't use strncmp(). Use strlcmp() instead. It's safer.

Does your message contain just one of these commands, or a command string followed by whitespace/open-parenthesis/etc.?
If it's the former, drop strncmp and just use strcmp.
If it's the latter, simply check isspace(message[strlen(command)]) or message[strlen(command)]=='(' or similar. (Note: strlen(command) is a constant and you should probably write it as such, or use a macro to get it from the size of the string literal.)

The only safe way to use strncmp to determine if two strings are equal is to verify beforehand that the strings have the same length:
/* len is a placeholder for whatever variable or function you use to get the length */
if ((len(a) == len(b)) && (strncmp(a, b, len(a)) == 0))
{
/* Strings are equal */
}
Otherwise you will match to something longer or shorter than your comparison:
strncmp(a, "test", strlen("test")) matches "testing", "test and a whole bunch of other characters", ect.
strncmp(a, "test", strlen(a)) matches "", "t", "te", "tes".

Use strcmp, but also compare the lengths of the two strings. If the lengths are identical, then strcmp will give you the result you seek.

Digging out from my memory doing C programming a year ago, I think the third argument is supposed to tell the function how many characters to process for the comparison. That's why it's safe as you have control over how many characters to process
So should be something like:
if(strncmp(message, "getPrimary", strlen("getPrimary")) {
//
}

Related

Extracting the domain extension of a URL stored in a string using scanf()

I am writing a code that takes a URL address as a string literal as input, then runs the domain extension of the URL through an array and returns the index if finds a match, -1 if does not.
For example, an input would be www.stackoverflow.com, in this case, I'd need to extract only the com part. In case of www.google.com.tr, I'd need only com again, ignoring the .tr part.
I can think of basically writing a function that'll do that just fine but I'm wondering if it is possible to do it using scanf() itself?
It's really an overhead to use scanf here. But you can do this to realize something similar
char a[MAXLEN],b[MAXLEN],c[MAXLEN];
scanf("%[^.].%[^.].%[^. \n]",a,b,c);
printf("Desired part is = %s\n",c);
To be sure that formatting is correct you can check whether this scanf call is successful or not. For example:
if( 3 != scanf("%[^.].%[^.].%[^. \n]",a,b,c)){
fprintf(stderr,"Format must be atleast sth.something.sth\n");
exit(EXIT_FAILURE);
}
What is the other way of achieving this same thing. Use fgets to read the whole line and then parse with strtok with delimiters ".". This way you will get parts of it. With fgets you can easily support different kind of rules. Instead of incorporating it in scanf (which will be a bit difficult in error case), you can use fgets,strtok to do the same.
With the solution provided above only the first three parts of the url is being considered. Rest are not parsed. But this is hardly the practical situation. Most the time we have to process the whole information, all the parts of the url (and we don't know how many parts can be there). Then you would be better using fgets/strtok as mentioned above.

Safety of using \0 before the end of character arrays

I am writing a driver for an embedded system that runs a custom version of modified linux (Its a handscanner). The manufacturer supplys a custom Eclipse Juno distribution with a few libraries and examples inbound.
The output I receive from the comport comes in form of a standard character array. I am using the individual characters in the array to convey information (error ids and error codes) like this:
if (tmp[i] == 250])
Where tmp is a character array in form of char tmp[500]; that is first initialized to 0 and then filled with input from the comport.
My question is:
Assuming I iterate through every piece of the array, is it safe to use 0 (as in \0) at any point before the end of the Array? Assuming I am:
Not treating it as a string (iterating through and using it like an int array)
In knowledge of what is going to be in there and what exactly this random \0 in the middle of it is supposed to mean.
The reason im asking is because I had several coworkers tell me that I should never ever ever use a character array that contains \0 before the end, no matter the circumstances.
My code doing this currently performs as expected, but im unsure if it might cause problems later.
Rewriting it to avoid this behaviour would be a non-trivial chunk of work.
Using an array of char as an array of small integers is perfectly fine. Just be careful not to pass it to any kind of function that expects "strings".
And if you want to be more explicit about it, and also make sure that the array is using unsigned char you could use uint8_t instead.

Can a C implementation use length-prefixed-strings "under the hood"?

After reading this question: What are the problems of a zero-terminated string that length-prefixed strings overcome? I started to wonder, what exactly is stopping a C implementation from allocating a few extra bytes for any char or wchar_t array allocated on the stack or heap and using them as a "string prefix" to store the number N of its elements?
Then, if the N-th character is '\0', N - 1 would signify the string length.
I believe this could mightily boost performance of functions such as strlen or strcat.
This could potentially turn to extra memory consumption if a program uses non-0-terminated char arrays extensively, but that could be remedied by a compiler flag turning on or off the regular "count-until-you-reach-'\0'" routine for the compiled code.
What are possible obstacles for such an implementation? Does the C Standard allow for this? What problems can this technique cause that I haven't accounted for?
And... has this actually ever been done?
You can store the length of the allocation. And malloc implementations really do do that (or some do, at least).
You can't reasonably store the length of whatever string is stored in the allocation, though, because the user can change the contents to their whim; it would be unreasonable to keep the length up to date. Furthermore, users might start strings somewhere in the middle of the character array, or might not even be using the array to hold a string!
Then, if the N-th character is '\0', N - 1 would signify the string length.
Actually, no, and that's why this suggestion cannot work.
If I overwrite a character in a string with a 0, I have effectively truncated the string, and a subsequent call of strlen on the string must return the truncated length. (This is commonly done by application programs, including every scanner generated by (f)lex, as well as the strtok standard library function. Amongst others.)
Moreover, it is entirely legal to call strlen on an interior byte of the string.
For example (just for demonstration purposes, although I'll bet you can find code almost identical to this in common use.)
/* Split a string like 'key=value...' into key and value parts, and
* return the value, and optionally its length (if the second argument
* is not a NULL pointer).
* On success, returns the value part and modifieds the original string
* so that it is the key.
* If there is no '=' in the supplied string, neither it nor the value
* pointed to by plen are modified, and NULL is returned.
*/
char* keyval_split(char* keyval, int* plen) {
char* delim = strchr(keyval, '=');
if (delim) {
if (plen) *plen = strlen(delim + 1)
*delim = 0;
return delim + 1;
} else {
return NULL;
}
}
There's nothing fundamentally stopping you from doing this in your application, if that was useful (one of the comments noted this). There are two problems that would emerge, however:
You'd have to reimplement all the string-handling functions, and have my_strlen, my_strcpy, and so on, and add string-creating functions. That might be annoying, but it's a bounded problem.
You'd have to stop people, when writing for the system, deliberately or automatically treating the associated character arrays as ‘ordinary’ C strings, and using the usual functions on them. You might have to make sure that such usages broke promptly.
This means that it would, I think, be infeasible to smuggle a reimplemented ‘C string’ into an existing program.
Something like
typedef struct {
size_t len;
char* buf;
} String;
size_t my_strlen(String*);
...
might work, since type-checking would frustrate (2) (unless someone decided to hack things ‘for efficiency’, in which case there's not much you can do).
Of course, you wouldn't do this unless and until you'd proven that string management was the bottleneck in your code and that this approach provably improved things....
There are a couple of issues with this approach. First of all, you wouldn't be able to create arbitrarily long strings. If you only reserve 1 byte for length, then your string can only go up to 255 characters. You can certainly use more bytes to store the length, but how many? 2? 4?
What if you try to concatenate two strings that are both at the edge of their size limits (i.e., if you use 1 byte for length and try to concatenate two 250-character strings to each other, what happens)? Do you simply add more bytes to the length as necessary?
Secondly, where do you store this metadata? It somehow has to be associated with the string. This is similar to the problem Dennis Ritchie ran into when he was implementing arrays in C. Originally, array objects stored an explicit pointer to the first element of the array, but as he added struct types to the language, he realized that he didn't want that metadata cluttering up the representation of the struct object in memory, so he got rid of it and introduced the rule that array expressions get converted to pointer expressions in most circumstances.
You could create a new aggregate type like
struct string
{
char *data;
size_t len;
};
but then you wouldn't be able to use the C string library to manipulate objects of that type; an implementation would still have to support the existing interface.
You could store the length in the leading byte or bytes of the string, but how many do you reserve? You could use a variable number of bytes to store the length, but now you need a way to distinguish length bytes from content bytes, and you can't read the first character by simply dereferencing the pointer. Functions like strcat would have to know how to step around the length bytes, how to adjust the contents if the number of length bytes changes, etc.
The 0-terminated approach has its disadvantages, but it's also a helluva lot easier to implement and makes manipulating strings a lot easier.
The string methods in the standard library have defined semantics. If one generates an array of char that contains various values, and passes a pointer to the array or a portion thereof, the methods whose behavior is defined in terms of NUL bytes must search for NUL bytes in the same fashion as defined by the standard.
One could define one's own methods for string handling which use a better form of string storage, and simply pretend that the standard library string-related functions don't exist unless one must pass strings to things like fopen. The biggest difficulty with such an approach is that unless one uses non-portable compiler features it would not be possible to use in-line string literals. Instead of saying:
ns_output(my_file, "This is a test"); // ns -- new string
one would have to say something more like:
MAKE_NEW_STRING(this_is_a_test, "This is a test");
ns_output(my_file, this_is_a_test);
where the macro MAKE_NEW_STRING would create a union of an anonymous type, define an instance called this_is_a_test, and suitably initialize it. Since a lot of strings would be of different anonymous types, type-checking would require that strings be unions that include a member of a known array type, and code expecting strings should be given a pointer that member, likely using something like:
#define ns_output(f,s) (ns_output_func((f),(s).stringref))
It would be possible to define the types in such a way as to avoid the need for the stringref member and have code just accept void*, but the stringref member would essentially perform static duck-typing (only things with a stringref member could be given to such a macro) and could also allow type-checking on the type of stringref itself).
If one could accept those constraints, I think one could probably write code that was more efficient in almost every way that zero-terminated strings; the question would be whether the advantages would be worth the hassle.

Prevent crash in string manipulation crashing whole application

I created a program which at regular intervals downloads a text file from a website, which is in csv format, and parses it, extracting relevant data, which then is displayed.
I have noticed that occasionally, every couple of months or so, it crashes. The crash is rare, considering the cycle of data downloading and parsing can happen every 5 minutes or even less. I am pretty sure it crashes inside the function that parses the string and extracts the data. When it crashes it happens during a congested internet connection, i.e. heavy downloads and/or a slow connection. Occasionally the remote site may be handing corrupt or incomplete data.
I used a test application which saves the data to be processed before processing it and it indeed shows it was not complete when a crash happens.
I have adapted the function to accommodate for a number of cases of invalid or incomplete data, as well as checking all return values. I also check return values of the various functions used to connect to the remote site and download the data. And will not go further when a return value indicates no success.
The core of the function uses strsep() to walk through the data and extract information out of it:
/ *
* delimiters typically contains: <;>, <">, < >
* strsep() is used to split part of the string using delimiter
* and copy into token which then is copied into the array
* normally the function stops way before ARRAYSIZE which is just a safeguard
* it would normally stop when the end of file is reached, i.e. \0
*/
for(n=0;n<ARRAYSIZE;n++)
{
token=strsep(&copy_of_downloaded_data, delimiters);
if (token==NULL)
break;
data->array[n].example=strndup(token, strlen(token));
if (data->array[n].example!=NULL)
{
token=strsep(&copy_of_downloaded_data, delimiters);
if (token==NULL)
break;
(..)
copy_of_downloaded_data=strchr(copy_of_downloaded_data,'\n'); /* find newline */
if (copy_of_downloaded_data==NULL)
break;
copy_of_downloaded_data=copy_of_downloaded_data+1;
if (copy_of_downloaded_data=='\0') /* find end of text */
break;
}
Since I suspect I can not account for all ways in which data can be corrupted I would like to know if there is a way to program this so the function when run does not crash the whole application in case of corrupted data.
If that is not possible what could I do to make it more robust.
Edit: One possible instance of a crash is when the data ends abruptly, where the middle of a field is cut of, i.e.
"test","example","this data is brok
At least I noticed it by looking through the saved data, however I found it not being consistent. Will have to stress test it as was suggested below.
The best thing to do would be to figure out what input causes the function to crash, and fix the function so that it does not crash. Since the function is doing string processing, this should be possible to do by feeding it lots of dummy/test data (or feeding it the "right" test data if it's a particular input that causes the crash). You basically want to torture-test the function until you find out how to make it crash on demand; at that point you can start investigating exactly where and why it crashes, and once you understand that, the necessary changes to fix the crash will probably become obvious to you.
Running the program under valgrind might also point you to the bug.
If for some reason you can't fix the bug, the other option is to spawn a child process and run the buggy code inside the child process. That way if it crashes, only the child process is lost and not the parent. (You can spawn the child process under most OS's by calling fork(); you'll need to come up with some way for the child process to communicate its results back to the parent process, of course). (Note that doing it this way is a kludge and will likely not be very efficient, and could also introduce a security hole into your application if someone malicious who has the ability to send your program input can figure out how to manipulate the bug in order to take control of the child process -- so I don't recommend this approach!)
What does the coredump point to?
strsep - does not have memory synchronization mechanisms, so protect it as a critical section ( lock it when you do strsep ) ?
see if strsep can handle a big chunk ( ARRAYSIZE is not gonna help you here ).
stack size of the thread/program that receives copy_of_downloaded_data ( i know you are only referencing it so look at the function that receives it. )
I would suggest that one should try to write code that keeps track of string lengths deliberately and doesn't care whether strings are zero-terminated or not. Even though null pointers have been termed the "billion dollar mistake"(*) I think zero-terminated strings are far worse. While there may be some situations where code using zero-terminated strings might be "simpler" than code that tracks string lengths, extra effort required to make sure that nothing can cause string-handling code to exceed buffer boundaries exceeds that required when working with known-length strings.
If, for example, one wants to store the concatenation of strings of length length1 and length2 into a buffer if length BUFF_SIZE, one can test easily whether length1+length2 <= BUFF_SIZE if one isn't expecting strings to be null-terminated, or length1+length2 < BUFF_SIZE if one expects a gratuitous null byte to follow every string. When using zero-terminated strings, one would have to determine the length of the two strings before concatenation, and having done so one could just as well use memcpy() rather than strcpy() or the useless strcat().
(*) There are many situations where it's much better to have a recognizably-invalid pointer than to require that pointers which can't point to anything meaningful must instead point to something meaningless. Many null-pointer related problems actually stem from a failure of implementations to trap arithmetic with null pointers; it's not fair to blame null pointers for problems that could have been, but weren't avoided.

Disadvantages of strlen() in Embedded Domain

What are the disadvantages of using strlen()?
If sometimes in TCP Communication NULL character comes in string than we find length of string up to only null character.
we cant find actual length of string.
if we make other alternative of this strlen function than its also stops at NULL character. so which method i can use to find out string length in C
To read from a "TCP Communication" you are probably using read. The prototype for read is
ssize_t read(int fildes, void *buf, size_t nbyte);
and the return value is the number of bytes read (even if they are 0).
So, let's say you're about to read 10 bytes, all of which are 0. You have an array with more than enough to hold all the data
int fildes;
char data[1000];
// fildes = TCPConnection
nbytes = read(fildes, data, 1000);
Now, by inspecting nbytes you know you have read 10 bytes. If you check data[0] through data[9] you will find they have 0;
If the runtime library provides strcpy() and strcat(), then surely it provides strlen().
I suspect you are confusing NULL, an invalid pointer value, from the ASCII code NUL, for a zero character value which indicates the end of string to many C runtime functions.
As such, there is no concern for inserting a NUL value in a string, nor in detecting it.
Response to updated question:
Since you seem to be processing binary data, the string functions are not a good fit—unless you can guarantee there are no NULs in the stream. However, for this reason, most TCP/IP messages use headers with fields containing the number of bytes which follow.
"Embedded" strikes me as a red herring here.
If you're processing binary data where an embedded NUL might be valid, then you can't expect meaningful results from strlen.
If you're processing strings (as that term is defined in C -- a block of non-NUL data terminated by a NUL) then you can use strlen just fine.
A system being "embedded" would affect this only to the degree that it might be less common to process strings and more common to process binary data.
it is safer to use strnlen instead of strlen to avoid the problems with strlen. The strlen problems are present everywhere not just embedded. Many of the string functions are dangerous because they go forever or until a zero is hit, or like scanf or strtok go until a pattern is hit.
Remember tcp is a stream not a packet, you may have to wait for multiple or many packets and piece together the data before you can attempt to call it a string anyway. that is assuming the payload is an asciiz string anyway, if raw data then dont use string functions use some other solution.
Yes, strlen() uses a terminating character \0 aka NUL. Most str* functions do so. There could be a risk that data coming from files/command line/sockets would not contain this character (usually, they won't: they'll be \n-terminated), but their size will also be provided by the read()/recv() function you've used. If that's a concern, you can always use a buffer slightly larger than what declared to those functions, e.g.
char mybuf[256+4];
mybuf[256]=0;
int reallen=fgets(mybuf, 256, stdin);
// we've got a 0-terminated string in mybuf.
If your data may not contain \0, compare strlen(mybuf) with reallen and terminate the session with an error code if they differ.
If your data may contain 0, then it should be processed as a buffer and not as a string. Size must be kept aside, and memcpy / memcmp functions should be used instead of strcpy and strcmp.
Also, your network protocol should be very explicit on whether strings or binary data is expected in different parts of the communication. HTTP is for instance, and it provides many way to tell the actual size of the transmitted payload.
This isn't specific to "embedded" programs, but it has come a major concern in every programs to ensure no remote code/script injection can occur. If by "embedded", you mean you're in a non-preemptive environment and have only limited time available to perform some action ... then yeah, you don't want to end up scanning 2GB of incoming bits for a (never-appearing) \0. either the above trick, or strnlen (mentioned in another answer) could be used to ensure this isn't the case.

Resources