How to validate POST-ed strings as valid UTF-8 (in C)?

How to validate POST-ed strings as valid UTF-8 (in C)? - c

We have a CGI-program, that processes POST-ed forms. Some of the POST-ed text can contain non-ASCII characters — and the browsers already helpfully convert these to UTF-8.
I need to "harden" the program to reject invalid strings — where a non-ASCII string is not a valid UTF-8 string either.
I thought, I'd rely on mbstowcs():
setlocale(LC_CTYPE, "en_US.UTF-8");
unilen = mbstowcs(NULL, foo, 0);
if (unilen == (size_t)-1) {
... report an error ...
}
However, I am having a hard time validating the method — it accepts valid strings alright, but I can't come up with an invalid one for it to reject...
Could someone, please, confirm, that this is a proper way and/or suggest an alternative?
Note, that I don't care for the actual result of conversion — once I'm confident, that the string is valid UTF-8, I'm copying it into an e-mail (with UTF-8 charset) and letting the recipient's e-mail program deal with it. The only reason I bother with the validation is to ensure, the form is not used to propagate arbitrary binaries (such as viruses).
Thanks!

The function documentation says
"If an invalid multibyte character is encountered, a (size_t)-1 value is returned."
So i believe your validation is pretty much fine. Personally, i always found this value corrupted for invalid data. You might submit an arbitrary hex sequence of even length to be certain.
If you are doubtful and need further validation, gnu iconv is a good alternate
utf-8 validation on SO

Related

What is the difference between sqlite3_bind_text, sqlite3_bind_text16 and sqlite3_bind_text64?

I am using sqlite3 C interface. After reading document at https://www.sqlite.org/c3ref/bind_blob.html , I am totally confused.
What is the difference between sqlite3_bind_text, sqlite3_bind_text16 and sqlite3_bind_text64?
The document only describe that sqlite3_bind_text64 can accept encoding parameter including SQLITE_UTF8, SQLITE_UTF16, SQLITE_UTF16BE, or SQLITE_UTF16LE.
So I guess, based on the parameters pass to these functions, that:
sqlite3_bind_text is for ANSI characters, char *
sqlite3_bind_text16 is for UTF-16 characters,
sqlite3_bind_text64 is for various encoding mentioned above.
Is that correct?
One more question:
The document said "If the fourth parameter to sqlite3_bind_text() or sqlite3_bind_text16() is negative, then the length of the string is the number of bytes up to the first zero terminator." But it does not said what will happen for sqlite3_bind_text64. Originally I thought this is a typo. However, when I pass -1 as the fourth parameter to sqlite3_bind_text64, I will always get SQLITE_TOOBIG error, that makes me think they remove sqlite3_bind_text64 from the above statement by purpose. Is that correct?
Thanks

sqlite3_bind_text() is for UTF-8 strings.
sqlite3_bind_text16() is for UTF-16 strings using your processor's native endianness.
sqlite3_bind_text64() lets you specify a particular encoding (utf-8, native utf-16, or a particular endian utf-16). You'll probably never need it.
sqlite3_bind_blob() should be used for non-Unicode strings that are just treated as binary blobs; all sqlite string functions work only with Unicode.

Will program crash if Unicode data is parsed as Multibyte?

Okay basically what I'm asking is:
Let's say I use PathFindFileNameA on a unicode enabled path. I obtain this path via GetModuleFileNameA, but since this api doesn't support unicode characters (italian characters for example) it will output junk characters in that part of the system path.
Let's assume x represents a junk character in the file path, such as:
C:\Users\xxxxxxx\Desktop\myfile.sys
I assume that PathFindFileNameA just parses the string with strtok till it encounters the last \\, and outputs the remainder in a preallocated buffer given str_length - pos_of_last \\.
The question is, will the PathFindFileNameA parse the string correctly even if it encounters the junk characters from a failed unicode conversion (since the multi-byte API reciprocal is called), or will it crash the program?
Don't answer something like "Well just use MultiByteToWideChar", or "Just use a wide-version of the API". I am asking a specific question, and a specific answer would be appreciated.
Thanks!

Why you think that Windows API only do strtok? I used to hear that all xxA APIs are redirected to xxW APIs before win10 are released.
And I think the answer to this question is quite simple. Just write a easy program and then set the code page to what you want.Running that program and the answer goes out.
P.S.:personally I think that GetModuleFileNameA will work correctly even if there are junk characters because Windows will store the Image Name as an UNICODE_STRING internally. And even if you uses MBCS, the junk code does not contain zero bytes, and it will work as usual since it's just using strncpy.
Sorry for my last answer :)

Portable literal strings in C source files

Ok, I have this:
AllocConsole();
SetConsoleOutputCP(CP_UTF8);
HANDLE consoleHandle = GetStdHandle(STD_OUTPUT_HANDLE);
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 10, NULL, NULL);
WriteConsoleW(consoleHandle, L"wΕλληνικά\n", 10, NULL, NULL);
printf("aΕλληνικά\n");
wprintf(L"wΕλληνικά\n");
Now, the issue is that depending on the encoding file was saved as only some these works. wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters). Yet, I have issue with three others. If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 18, NULL, NULL);
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
Now, let's go to UTF-16 (Unicode - Codepage 1200). Works only WriteConsoleW. Character count in WriteConsoleA should be 10 to keep formatting precise.
Saving in UTF-16 Big Endian mode (Unicode - Codepage 1201) does not change anything. Again, WTF? Shouldn't byte order inside the strings be inverted when stored to file?
Conclusion is that the way strings are compiled into binary form depends on the encoding used. Therefore, what is the portable and compiler independent way to store strings? Is there a preprocessor which would convert one string representation into another before compilation, so I could store file in UTF-8 and only preprocess strings which I need to have in UTF-16 by wrapping them some macro.

I think you've got at least a few assumptions here which are either wrong or not 100% correct as far as I know:
Now, the issue is that depending on the encoding file was saved as only some these works.
Of course, because the encoding determines how to Interpret the string literals.
wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters).
I've never heard of that one, but I'm rather sure this depends on the locale set for your program. I've got a few work Projects where a locale is set and the output is just fine using German umlauts etc.
If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
That's because the ANSI version wants an ANSI string, while you're passing a UTF-8 encoded string (based on the file's encoding). The output still works, because the console handles the UTF-8 conversion for you - you're essentially printing raw UTF-8 here.
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
I don't think so (although I'm not sure why it isn't working either). Have you tried Setting some easy to find string and look for it in the resulting binary? I'm rather sure it's indeed encoded using UTF-16. I assume due to the missing BOM the compiler might interpret the whole thing as a narrow string and therefore converts the UTF-8 stuff wrong.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
This is exactly what I described above. Now the wide string is encoded properly, because the Compiler now knows the file is in UTF-8, not ANSI (or some codepage). The narrow string is properly converted to the locale being used as well.
Overall, there's no encoding independant way to do it, unless you escape everything using the proper codepage and/or UTF codes in advance. I'd just stick to UTF-8 with BOM, because I think all current compilers will be able to properly read and Interpret the file (besides Microsoft's Resource Compiler; although I haven't tried feeding the 2012 Version with UTF-8).
Edit:
To use an analogy:
You're essentially saving a raw image to a file and you expect it to work properly, no matter whether other programs try to read it as a grayscale, palettized, or full color image. This won't work (despite differences being smaller).

The answer is here.
Quoting:
It is impossible for the compiler to intermix UTF-8 and UTF-16
strings into the compiled output! So you have to decide for one source
code file:
either use UTF-8 with BOM and generate UTF-16 strings only (i.e.always use L prefix),
or UTF-8 without BOM and generate UTF-8 strings only (i.e. never use L prefix),
7-bit ASCII characters are not involved and can be used with or without L prefix
The only portable and compiler independent way is to use ASCII charset and escape sequences, because there are no guarantees that any compiler would accept UTF-8 encoded file and a compiler treatment of those multibyte sequences might vary.

Is sscanf considered safe to use?

I have vague memories of suggestions that sscanf was bad. I know it won't overflow buffers if I use the field width specifier, so is my memory just playing tricks with me?

I think it depends on how you're using it: If you're scanning for something like int, it's fine. If you're scanning for a string, it's not (unless there was a width field I'm forgetting?).
Edit:
It's not always safe for scanning strings.
If your buffer size is a constant, then you can certainly specify it as something like %20s. But if it's not a constant, you need to specify it in the format string, and you'd need to do:
char format[80]; //Make sure this is big enough... kinda painful
sprintf(format, "%%%ds", cchBuffer - 1); //Don't miss the percent signs and - 1!
sscanf(format, input); //Good luck
which is possible but very easy to get wrong, like I did in my previous edit (forgot to take care of the null-terminator). You might even overflow the format string buffer.

The reason why sscanf might be considered bad is because it doesnt require you to specify maximum string width for string arguments, which could result in overflows if the input read from the source string is longer. so the precise answer is: it is safe if you specify widths properly in the format string otherwise not.

Note that as long as your buffers are at least as long as strlen(input_string)+1, there is no way the %s or %[ specifiers can overflow. You can also use field widths in the specifiers if you want to enforce stricter limits, or you can use %*s and %*[ to suppress assignment and instead use %n before and after to get the offsets in the original string, and then use those to read the resulting sub-string in-place from the input string.

Yes it is..if you specify the string width so the are no buffer overflow related problems.
Anyway, like #Mehrdad showed us, there will be possible problems if the buffer size isn't established at compile-time. I suppose that put a limit to the length of a string that can be supplied to sscanf, could eliminate the problem.

All of the scanf functions have fundamental design flaws, only some of which could be fixed. They should not be used in production code.
Numeric conversion has full-on demons-fly-out-of-your-nose undefined behavior if a value overflows the representable range of the variable you're storing the value in. I am not making this up. The C library is allowed to crash your program just because somebody typed too many input digits. Even if it doesn't crash, it's not obliged to do anything sensible. There is no workaround.
As pointed out in several other answers, %s is just as dangerous as the infamous gets. It's possible to avoid this by using either the 'm' modifier, or a field width, but you have to remember to do that for every single text field you want to convert, and you have to wire the field widths into the format string -- you can't pass sizeof(buff) as an argument.
If the input does not exactly match the format string, sscanf doesn't tell you how many characters into the input buffer it got before it gave up. This means the only practical error-recovery policy is to discard the entire input buffer. This can be OK if you are processing a file that's a simple linear array of records of some sort (e.g. with a CSV file, "skip the malformed line and go on to the next one" is a sensible error recovery policy), but if the input has any more structure than that, you're hosed.
In C, parse jobs that aren't complicated enough to justify using lex and yacc are generally best done either with POSIX regexps (regex.h) or with hand-rolled string parsing. The strto* numeric conversion functions do have well-specified and useful behavior on overflow and do tell you how may characters of input they consumed, and string.h has lots of handy functions for hand-rolled parsers (strchr, strcspn, strsep, etc).

There is 2 point to take care.
The output buffer[s].
As mention by others if you specify a size smaller or equals to the output buffer size in the format string you are safe.
The input buffer.
Here you need to make sure that it is a null terminate string or that you will not read more than the input buffer size.
If the input string is not null terminated sscanf may read past the boundary of the buffer and crash if the memorie is not allocated.

use typeof in input validation

Can we use __typeof__ for input validation in C program run on a Linux platform and how?
If we can't then, are there any ways other than regex to achieve the same?

"typeof" is purely a compile-time directive. It cannot be used for "input validation."
Input validation rules can be complex. In C, they are made more complex by the fact that the tools that you have at your disposal in the standard C library are pretty awful. One example is atoi(), which will return 0 if the string that you pass in doesn't contain a number at the beginning (atoi("hello world") == 0), and "1337isanumber" will actually return 1337. To simply validate if something is a number, you could (assuming ASCII and not Unicode) use a loop and make sure each value up until the first null terminator (or the size of the memory you allocated for the string) that each digit is in fact numeric. A similar procedure could be done to check if something is alphanumeric, etc. As you mentioned, regexes can be used for a telephone number or some other relatively complex data format.
Your comment below references using "instanceof" in Java for input validation, but this isn't possible either. If you get user input from, say, the command line, or a query string parameter, or whatever, it really comes in as a string. If you're using a Scanner object to scan the standard input and use a method such as nextInt(), it's really converting a string (from the stream) into something, which can throw a runtime exception. You cannot use instanceof to determine a string's contents; a String is a String -- even if its contents are "42", it is not an instance of an Integer!