Is it safe to concatenate formatted strings using sprintf()? - c

I am writing a program that requires a gradual building of a formatted string, to be printed out as the last stage. The string includes numbers that are collected while the string is formed. Thus, I need to add formatted string fragments to the output string.
One straight forward way is to use sprintf() to a temporary string that contains the formatted fragment, which is then concatenated to the output string using strcat(), like demonstrated in this answer.
A more sophisticated approach is to point sprintf() to the end of the current output string, when adding the new fragment. This is demonstrated here.
The help page to the MSVC sprintf_s() function (and the other variants of sprintf()) states that:
If copying occurs between strings that overlap, the behavior is
undefined.
Now, technically, using sprintf() to concatenate the fragment to the end of the output string means overwriting the terminating NULL, which is considered a part of the first string. So, this action falls under the category of overlapping strings. The technique seems to work well, but is it really safe?

The method in the answer you linked to:
strcat() for formatted strings
is safe against overlapping string issues, but of course it's unsafe in that it's performing unbounded writes using sprintf rather than snprintf.
What's not safe, and what the text about overlapping strings is referring to, is something like:
snprintf(buf, sizeof buf, "%s%s", buf, tail);
Here, overlapping ranges of buf are being used both as an input and an output, and this results in undefined behavior. Don't do it.

Related

Splitting input lines at a comma

I am reading the contents of a file into a 2D array. The file is of the type:
FirstName,Surname
FirstName,Surname
etc. This is a homework exercise, and we can assume that everyone has a first name and a surname.
How would I go about splitting the line using the comma so that in a 2D array it would look like this:
char name[100][2];
with
Column1 Column2
Row 0 FirstName Surname
Row 1 FirstName Surname
I am really struggling with this and couldn't find any help that I could understand.
You can use strtok to tokenize your string based on a delimiter, and then strcpy the pointer to the token returned into your name array.
Alternatively, you could use strchr to find the location of the comma, and then use memcpy to copy the parts of the string before and after this point into your name array. This way will also preserve your initial string and not mangle it the way strtok would. It'll also be more thread-safe than using strtok.
Note: a thread-safe alternative to strtok is strtok_r, however that's declared as part of the POSIX standard. If that function's not available to you there may be a similar one defined for your environment.
EDIT: Another way is by using sscanf, however you won't be able to use the %s format specifier for the first string, you'd instead have to use a specifier with a set of characters to not match against (','). Since it's homework (and really simply) I'll let you figure that out.
EDIT2: Also, your array should be char name[2][100] for an array of two strings, each of 100 chars in size. Otherwise, with the way you have it, you'll have an array of 100 strings, each of 2 chars in size.

Is sscanf considered safe to use?

I have vague memories of suggestions that sscanf was bad. I know it won't overflow buffers if I use the field width specifier, so is my memory just playing tricks with me?
I think it depends on how you're using it: If you're scanning for something like int, it's fine. If you're scanning for a string, it's not (unless there was a width field I'm forgetting?).
Edit:
It's not always safe for scanning strings.
If your buffer size is a constant, then you can certainly specify it as something like %20s. But if it's not a constant, you need to specify it in the format string, and you'd need to do:
char format[80]; //Make sure this is big enough... kinda painful
sprintf(format, "%%%ds", cchBuffer - 1); //Don't miss the percent signs and - 1!
sscanf(format, input); //Good luck
which is possible but very easy to get wrong, like I did in my previous edit (forgot to take care of the null-terminator). You might even overflow the format string buffer.
The reason why sscanf might be considered bad is because it doesnt require you to specify maximum string width for string arguments, which could result in overflows if the input read from the source string is longer. so the precise answer is: it is safe if you specify widths properly in the format string otherwise not.
Note that as long as your buffers are at least as long as strlen(input_string)+1, there is no way the %s or %[ specifiers can overflow. You can also use field widths in the specifiers if you want to enforce stricter limits, or you can use %*s and %*[ to suppress assignment and instead use %n before and after to get the offsets in the original string, and then use those to read the resulting sub-string in-place from the input string.
Yes it is..if you specify the string width so the are no buffer overflow related problems.
Anyway, like #Mehrdad showed us, there will be possible problems if the buffer size isn't established at compile-time. I suppose that put a limit to the length of a string that can be supplied to sscanf, could eliminate the problem.
All of the scanf functions have fundamental design flaws, only some of which could be fixed. They should not be used in production code.
Numeric conversion has full-on demons-fly-out-of-your-nose undefined behavior if a value overflows the representable range of the variable you're storing the value in. I am not making this up. The C library is allowed to crash your program just because somebody typed too many input digits. Even if it doesn't crash, it's not obliged to do anything sensible. There is no workaround.
As pointed out in several other answers, %s is just as dangerous as the infamous gets. It's possible to avoid this by using either the 'm' modifier, or a field width, but you have to remember to do that for every single text field you want to convert, and you have to wire the field widths into the format string -- you can't pass sizeof(buff) as an argument.
If the input does not exactly match the format string, sscanf doesn't tell you how many characters into the input buffer it got before it gave up. This means the only practical error-recovery policy is to discard the entire input buffer. This can be OK if you are processing a file that's a simple linear array of records of some sort (e.g. with a CSV file, "skip the malformed line and go on to the next one" is a sensible error recovery policy), but if the input has any more structure than that, you're hosed.
In C, parse jobs that aren't complicated enough to justify using lex and yacc are generally best done either with POSIX regexps (regex.h) or with hand-rolled string parsing. The strto* numeric conversion functions do have well-specified and useful behavior on overflow and do tell you how may characters of input they consumed, and string.h has lots of handy functions for hand-rolled parsers (strchr, strcspn, strsep, etc).
There is 2 point to take care.
The output buffer[s].
As mention by others if you specify a size smaller or equals to the output buffer size in the format string you are safe.
The input buffer.
Here you need to make sure that it is a null terminate string or that you will not read more than the input buffer size.
If the input string is not null terminated sscanf may read past the boundary of the buffer and crash if the memorie is not allocated.

Reading user input and checking the string

How does one check the read in string for a substring in C?
If I have the following
char name[21];
fgets(name, 21, stdin);
How do I check the string for a series of substrings?
How does one check for a substring before a character? For example, how would one check for a substring before an = sign?
Be wary of strtok(); it is not re-entrant. Amongst other things, it means that if you need to call it in one function, and then call another function, and if that other function also uses strtok(), your first function is messed up. It also writes NUL ('\0') bytes over the separators, so it modifies the input string as it goes. If you are looking for more than one terminator character, you can't tell which one was found. Further, if you write a library function for others to use, yet your function uses strtok(), you must document the fact so that callers of your function are not bemused by the failures of their own code that uses strtok() after calling your function. In other words, it is poisonous; if your function calls strtok(), it makes your function unreusable, in general; similarly, your code that uses strtok() cannot call other people's functions that also use it.
If you still like the idea of the functionality - some people do (but I almost invariably avoid it) - then look for strtok_r() on your system. It is re-entrant; it takes an extra parameter which means that other functions can use strtok_r() (or strtok()) without affecting your function.
There are a variety of alternatives that might be appropriate. The obvious ones to consider are strchr(), strrchr(), strpbrk(), strspn(), strcspn(): none of these modify the strings they analyze. All are part of Standard C (as is strtok()), so they are essentially available everywhere. Looking for the material before a single character suggests that you should use strchr().
Use strtok() to split the string into tokens.
char *pch;
pch = strtok (name,"=");
if (pch != NULL)
{
printf ("Substring: %s\n",pch);
}
You can keep calling strtok() to find more strings after the =.
You can use strtok but it's not reentrant and it destroys the original string. Other (perhaps safer) functions to look into would be strchr, strstr, strspn, and perhaps the mem* variations. In general, I avoid strn* variants because, while they do "boinds checking," they still rely on the nul terminator. They can fail on a valid string that just happens to be longer than you expected to deal with, and they won't actually prevent a buffer overrun unless you know the buffer size. Better (IMHO) to ignore the terminator and know exactly how much data you're working with every time the way the mem* functions work.

The terminating NULL in an array in C

I have a simple question. Why is it necessary to consider the terminating null in an
array of chars (or simply a string) and not in an array of integers. So when i want a string to hold 20 characters i need to declare char string[21];. When i want to declare an array of integers holding 5 digits then int digits[5]; is enough. What is the reason for this?
You don't have to terminate a char array with NULL if you don't want to, but when using them to represent a string, then you need to do it because C uses null-terminated strings to represent its strings. When you use functions that operate on strings (like strlen for string-length or using printf to output a string), then those functions will read through the data until a NULL is encountered. If one isn't present, then you would likely run into buffer overflow or similar access violation/segmentation fault problems.
In short: that's how C represents string data.
Null terminators are required at the end of strings (or character arrays) because:
Most standard library string functions expect the null character to be there. It's put there in lieu of passing an explicit string length (though some functions require that instead.)
By design, the NUL character (ASCII 0x00) is used to designate the end of strings. Hence why it's also used as an EOF character when reading from ASCII files or streams.
Technically, if you're doing your own string manipulation with your own coded functions, you don't need a null terminator; you just need to keep track of how long the string is. But, if you use just about anything standardized, it will expect it.
It is only by convention that C strings end in the ascii nul character. (That's actually something different than NULL.)
If you like, you can begin your strings with a nul byte, or randomly include nul bytes in the middle of strings. You will then need your own library.
So the answer is: all arrays must allocate space for all of their elements. Your "20 character string" is simply a 21-character string, including the nul byte.
The reason is it was a design choice of the original implementors. A null terminated string gives you a way to pass an array into a function and not pass the size. With an integer array you must always pass the size. Ints convention of the language nothing more you could rewrite every string function in c with out using a null terminator but you would allways have to keep track of your array size.
The purpose of null termination in strings is so that the parser knows when to stop iterating through the array of characters.
So, when you use printf with the %s format character, it's essentially doing this:
int i = 0;
while(input[i] != '\0') {
output(input[i]);
i++;
}
This concept is commonly known as a sentinel.
It's not about declaring an array that's one-bigger, it's really about how we choose to define strings in C.
C strings by convention are considered to be a series of characters terminated by a final NUL character, as you know. This is baked into the language in the form of interpreting "string literals", and is adopted by all the standard library functions like strcpy and printf and etc. Everyone agrees that this is how we'll do strings in C, and that character is there to tell those functions where the string stops.
Looking at your question the other way around, the reason you don't do something similar in your arrays of integers is because you have some other way of knowing how long the array is-- either you pass around a length with it, or it has some assumed size. Strings could work this way in C, or have some other structure to them, but they don't -- the guys at Bell Labs decided that "strings" would be a standard array of characters, but would always have the terminating NUL so you'd know where it ended. (This was a good tradeoff at that time.)
It's not absolutely necessary to have the character array be 21 elements. It's only necessary if you follow the (nearly always assumed) convention that the twenty characters be followed by a null terminator. There is usually no such convention for a terminator in integer and other arrays.
Because of the the technical reasons of how C Strings are implemented compared to other conventions
Actually - you don't have to NUL-terminate your strings if you don't want to!
The only problem is you have to re-write all the string libraries because they depend on them. It's just a matter of doing it the way the library expects if you want to use their functionality.
Just like I have to bring home your daughter at midnight if I wish to date her - just an agreement with the library (or in this case, the father).

differences between memchr() and strchr()

What is the actual difference between memchr() and strchr(), besides the extra parameter? When do you use one or the other one? and would there be a better outcome performance replacing strchr() by memchr() if parsing a big file (theoretically speaking)?
strchr stops when it hits a null character but memchr does not; this is why the former does not need a length parameter but the latter does.
Functionally there is no difference in that they both scan an array / pointer for a provided value. The memchr version just takes an extra parameter because it needs to know the length of the provided pointer. The strchr version can avoid this because it can use strlen to calculate the length of the string.
Differences can popup if you attempt to use a char* which stores binary data with strchr as it potentially won't see the full length of the string. This is true of pretty much any char* with binary data and a str* function. For non-binary data though they are virtually the same function.
You can actually code up strchr in terms of memchr fairly easily
const char* strchr(const char* pStr, char value) {
return (const char*)memchr(pStr, value, strlen(pStr)+1);
}
The +1 is necessary here because strchr can be used to find the null terminator in the string. This is definitely not an optimal implementation because it walks the memory twice. But it does serve to demonstrate how close the two are in functionality.
strchr expects that the first parameter is null-terminated, and hence doesn't require a length parameter.
memchr works similarly but doesn't expect that the memory block is null-terminated, so you may be searching for a \0 character successfully.
No real difference, just that strchr() assumes it is looking through a null-terminated string (so that determines the size).
memchr() simply looks for the given value up to the size passed in.
In practical terms, there's not much difference. Also, implementations are free to make one function faster than the other.
The real difference comes from context. If you're dealing with strings, then use strchr(). If you have a finite-size, non-terminated buffer, then use memchr(). If you want to search a finite-size subset of a string, then use memchr().

Resources