How to deal with Unicode paths in a cross-platfrom C library? - c

I'm contributing to a C library. It has a function that takes a char* parameter for a file path name. The authors are mostly UNIX developers, and this works fine on unixes where char* mostly means UTF-8. (At least in GCC, the character set is configurable and UTF-8 is the default.)
However, char* means ANSI on Windows, which implies that it is currently impossible to use Unicode path names with this library on Windows, where wchar_t* should be used and only UTF-16 is supported. (A quick search on StackOverflow reveals that the ANSI Windows API functions can not be used with UTF-8.)
The question is, what is the right way to deal with this? We've come up with various ways to do it, but neither of us are Windows experts, so we can't really decide how to do it properly. Our goal is that the users of the library should be able to write cross-platform code that would work on unixes as well as windows.
Under the hood, the library has #ifdefs in place to differentiate between operating systems so that it can use POSIX functions on UNIXes and Win32 APIs on Windows.
So far, we've come up with the following possibilities:
Offer a separate windows-only function that accepts a wchar_t*.
Require UTF-16 on Windows and #ifdef the library header in such a way that the function would accept wchar_t* on Windows.
Add a flag that would tell the function to cast the given char* to wchar_t* and call the widechar Windows APIs.
Create a variant of the function that takes a file descriptor (or file handle on Windows) instead of a file path.
Always require UTF-8 (even on Windows), and then inside the function, convert UTF-8 to UTF-16 and call the widechar Windows APIs.
The problem with options 1-4 is that they would require the user to consciously take care of portability themselves. Option 5 sounds good, but I'm not sure if this is the right way to go.
I'm also open to other suggestions or ideas that can solve this. :)

Since portability is an important goal for you, I think it is imperative for your function semantics to be precisely defined. Among other things, that means that the arguments' types and meanings don't vary across platforms. So, if you have a function that accepts regular char based paths then it should accept such paths on all systems, and the encoding expected of those paths should be well-defined (which does not necessarily mean "the same"). That rules out options (2) and (3).
Moreover, portability requires the same functions to be usable across all platforms; that rules out (1). Option (4) could be ok if a stream- and/or file descriptor-based approach were the only one provided by your library, but it yields portability only with respect to those functions, not with respect to the path-based ones. (And note that stream (FILE *) APIs are defined by C, whereas file descriptors are a POSIX concept, not native to C. In principle, therefore, streams are more portable than file descriptors.)
(5) could work, but it places stronger constraints than you actually need. It is not essential for the function to define the encoding expected (though it can); it suffices for it to define how that encoding is determined.
Additionally, you could add wchar_t-based functions that work everywhere (as opposed to Windows-only). Those might be more convenient for Windows users. Similar to alternative (4), however, that provides portability only with respect to those functions. Supposing that you don't want to drop the char-based ones, you would need to pair this alternative with some variation on (5).

Related

Why does system() exist?

Many papers and such mention that calls to 'system()' are unsafe and unportable. I do not dispute their arguments.
I have noticed, though, that many Unix utilities have a C library equivalent. If not, the source is available for a wide variety of these tools.
While many papers and such recommend against goto, there are those who can make an argument for its use, and there are simple reasons why it's in C at all.
So, why do we need system()? How much existing code relies on it that can't easily be changed?
sarcastic answer Because if it didn't exist people would ask why that functionality didn't exist...
better answer
Many of the system functionality is not part of the 'C' standard but are part of say the Linux spec and Windows most likely has some equivalent. So if you're writing an app that will only be used on Linux environments then using these functions is not an issue, and as such is actually useful. If you're writing an application that can run on both Linux and Windows (or others) these calls become problematic because they may not be portable between system. The key (imo) is that you are simply aware of the issues/concerns and program accordingly (e.g. use appropriate #ifdef's to protect the code etc...)
The closest thing to an official "why" answer you're likely to find is the C89 Rationale. 4.10.4.5 The system function reads:
The system function allows a program to suspend its execution temporarily in order to run another program to completion.
Information may be passed to the called program in three ways: through command-line argument strings, through the environment, and (most portably) through data files. Before calling the system function, the calling program should close all such data files.
Information may be returned from the called program in two ways: through the implementation-defined return value (in many implementations, the termination status code which is the argument to the exit function is returned by the implementation to the caller as the value returned by the system function), and (most portably) through data files.
If the environment is interactive, information may also be exchanged with users of interactive devices.
Some implementations offer built-in programs called "commands" (for example, date) which may provide useful information to an application program via the system function. The Standard does not attempt to characterize such commands, and their use is not portable.
On the other hand, the use of the system function is portable, provided the implementation supports the capability. The Standard permits the application to ascertain this by calling the system function with a null pointer argument. Whether more levels of nesting are supported can also be ascertained this way; assuming more than one such level is obviously dangerous.
Aside from that, I would say mainly for historical reasons. In the early days of Unix and C, system was a convenient library function that fulfilled a need that several interactive programs needed: as mentioned above, "suspend[ing] its execution temporarily in order to run another program". It's not well-designed or suitable for any serious tasks (the POSIX requirements for it make it fundamentally non-thread-safe, it doesn't admit asynchronous events to be handled by the calling program while the other program is running, etc.) and its use is error-prone (safe construction of command string is difficult) and non-portable (because the particular form of command strings is implementation-defined, though POSIX defines this for POSIX-conforming implementations).
If C were being designed today, it almost certainly would not include system, and would either leave this type of functionality entirely to the implementation and its library extensions, or would specify something more akin to posix_spawn and related interfaces.
Many interactive applications offer a way for users to execute shell commands. For instance, in vi you can do:
:!ls
and it will execute the ls command. system() is a function they can use to do this, rather than having to write their own fork() and exec() code.
Also, fork() and exec() aren't portable between operating systems; using system() makes code that executes shell commands more portable.

Are functions such as printf() implemented differently for Linux and Windows

Something I still don't fully understand. For example, standard C functions such as printf() and scanf() which deal with sending data to the standard output or getting data from the standard input. Will the source code which implements these functions be different depending on if we are using them for Windows or Linux?
I'm guessing the quick answer would be "yes", but do they really have to be different?
I'm probably wrong , but my guess is that the actual function code be the same, but the lower layer functions of the OS that eventually get called by these functions are different. So could any compiler compile these same C functions, but it is what gets linked after (what these functions depend on to work on lower layers) is what gives us the required behavior?
Will the source code which implements these functions be different
depending on if we are using them for Windows or Linux?
Probably. It may even be different on different Linuxes, and for different Windows programs. There are several distinct implementations of the C standard library available for Linux, and maybe even more than one for Windows. Distinct implementations will have different implementation code, otherwise lawyers get involved.
my guess is that the actual function code be the same, but the lower
layer functions of the OS that eventually get called by these
functions are different. So could any compiler compile these same C
functions, but it is what gets linked after (what these functions
depend on to work on lower layers) is what gives us the required
behavior?
It is conceivable that standard library functions would be written in a way that abstracts the environment dependencies to some lower layer, so that the same source for each of those functions themselves can be used in multiple environments, with some kind of environment-specific compatibility layer underneath. Inasmuch as the GNU C library supports a wide variety of environments, it serves as an example of the general principle, though Windows is not among the environments it supports. Even then, however, the environment distinction would be effective even before the link stage. Different environments have a variety of binary formats.
In practice, however, you are very unlikely to see the situation you describe for Windows and Linux.
Yes, they have different implementations.
Moreover you might be using multiple different implementations on the same OS. For example:
MinGW is shipped with its own implementation of standard library which is different from the one used by MSVC.
There are many different implementations of C library even for Linux: glibc, musl, dietlibc and others.
Obviously, this means there is some code duplication in the community, but there are many good reasons for that:
People have different views on how things should be implemented and tested. This alone is enough to "fork" the project.
License: implementations put some restrictions on how they can be used and might require some actions from the end user (GPL requires you to share your code in some cases). Not everyone can follow those requirements.
People have very different needs. Some environments are multithreaded, some are not. printf might need or might not need to use some thread synchronization mechanisms. Some people need locale support, some don't. All this can bloat the code in the end, not everyone is willing to pay for things they do not use. Even strerror is vastly different on different OSes.
Aforementioned synchronization mechanisms are usually OS-specific and work in specific ways. Same can be said about locale handling, signal handling and other things, including the actual data writing and reading.
Some implementations add non-standard extensions that can make your life easier. Not all of those make sense on other OSes. For example glibc adds 'e' mode specifier to open file with O_CLOEXEC flag. This doesn't make sense for Windows.
Many complex things cannot be implemented in pure C and require some compiler-specific extensions. This can tie implementation to a limited number of compilers.
In the end, it is much simpler to have many C libraries, than trying to create a one-size-fits-all implementation.
As you say the higher level parts of the implementation of something like printf, like the code used to format the string using the arguments, can be written in a cross-platform way and be shared between Linux and Windows. I'm not sure if there's a C library that actually does it though.
But to interact with the hardware or use other operating system facilities (such as when printf writes to the console), the libc implementation has to use the OS's interface: the system calls. And these are very different between Windows and Unix-likes, and different even among Unix-likes (POSIX specifies a lot of them but there are OS specific extensions). For example here you can find system call tables for Linux and Windows.
There are two parts to functions like printf(). The first part parses the format string, and assembles an array of characters ready for output. If this part is written in C, there's no reason preventing it being common across all C libraries, and no reason preventing it being different, so long the standard definition of what printf() does is implemented. As it happens, different library developers have read the standard's definition of printf(), and have come up with different ways of parsing and acting on the format string. Most of them have done so correctly.
The second part, the bit that outputs those characters to stdout, is where the differences come in. It depends on using the kernel system call interface; it's the kernel / OS that looks after input/output, and that is done in a specific way. The source code required to get the Linux kernel to output characters is very different to that required to get Windows to output characters.
On Linux, it's usual to use glibc; this does some elaborate things with printf(), buffering the output characters in a pipe until a newline is output, and only then calling the Linux system call for displaying characters on the screen. This means that printf() calls from separate threads are neatly separated, each being on their own line. But the same program source code, compiled against another C library for Linux, won't necessarily do the same thing, resulting in printf() output from different threads being all jumbled up and unreadable.
There's also no reason why the library that contains printf() should be written in C. So long as the same function calling convention as used by the C compiler is honoured, you could write it in assembler (though that'd be slightly mad!). Or Ada (calling convention might be a bit tricky...).
Will the source code which implements these functions be different
Let us try another point-of-view: competition.
No. Competitors in industry are not required by the C spec to share source code to issue a compliant compiler - nor would various standard C library developers always want to.
C does not require "open source".

When should Win32/WinAPI types be used vs. Standard C types?

I'm late to the Win32 party and there's a sea of functions such as _tprintf, TEXT(), there are also libraries like strsafe.h which have such functions like StringCchCopy(), StringCchLength(), etc.. Basically, Win32 API introduces a bunch of extra functions and types on top of C which can be confusing to a C programmer who hasn't worked with Win32 much. I do not have a problem finding the definitions of these types and functions on MSDN. However, I do have a problem finding guidelines on when and why they should be used.
I have 2 questions:
How important is it to use all of these types and special functions which Microsoft has provided on top of standard C when programming with Win32? Is it considered good practice to do away with all standard C functions and types and use entirely Microsoft wrappers?
Is it okay to mix standard C functions in with these Microsoft types and functions? For example, to use malloc() instead of HeapAlloc(), or to use printf() instead of _tprintf() and etc...?
I have a copy of Charles Petzold's Programming Windows Fifth Edition book but it mostly covers GUI stuff and not a lot of the remainder of the API.
There are actually 3 questions here, the ones you explicitly asked, and the one you didn't. Let's get that last one out of the way first, as it tends to cause the most confusion:
What are those _t-extensions offered by the Microsoft-provided CRT?
They are generic-text mappings, introduced to make it possible to write code that targets both ANSI-based systems (Win9x) as well as Unicode-based systems (Windows NT). They are macros that expand to the actual function calls, based on the _UNICODE and _MBCS preprocessor symbols. For example, the symbol _tprintf expands to either printf or wprintf.
Likewise, the Windows API provides both ANSI and Unicode versions of the API calls. They, too, are preprocessor macros that expand to the actual API call, depending on the preprocessor symbol UNICODE. For example, the CreateFile symbol expands to CreateFileA or CreateFileW.
Generic-text mappings haven't been useful in the past two decades. Today, simply use the Unicode versions of the CRT and API calls (e.g. wprintf and CreateFileW). You can define _UNICODE and UNICODE for good measure, too, so that you don't accidentally call an ANSI version.
there are also libraries like strsafe.h which have such functions like StringCchCopy(), StringCchLength()
Those are safe variants of the CRT string manipulation calls. They are safer than e.g. strcpy by providing the buffer size of the destination, similar to strncpy. The latter, however, suffers from an awkward design decision, that causes the destination buffer to not get zero-terminated, in case the source won't fit. StringCchCopy will always zero-terminate the destination buffer, and thus provides additional safety over the CRT implementations. (Note: C11 introduces safe variants, e.g. strncpy_s, that will always zero-terminate the destination array, in case the input is valid. They also validate the input, calling the currently installed constraint handler when validation fails, thus providing even stronger safety than the strsafe.h implementations. The bounds-checked implementations are a conditional feature of C11.)
How important is it to use all of these types and special functions which Microsoft has provided on top of standard C when programming with Win32? Is it considered good practice to do away with all standard C functions and types and use entirely Microsoft wrappers?
It is not important at all. You can use whichever is more suitable in your scenario. If in doubt, writing portable (i.e. Standard C) code is generally preferable. You only ever want to call the Windows API calls, if you need the additional control they offer (e.g. HeapAlloc allows more control over the allocation than malloc does; likewise CreateFile provides more options than fopen).
Is it okay to mix standard C functions in with these Microsoft types and functions? For example, to use malloc() instead of HeapAlloc(), or to use printf() instead of _tprintf() and etc...?
In general, yes, as long as you match those calls: HeapFree what you HeapAlloc, free what you malloc. You must not mix HeapAlloc and free, for example. In case a Windows API call requires special memory management functions to be used, it is explicitly pointed out in the documentation. For example, if FormatMessage is requested to allocate the buffer to return data, it must be freed using LocalFree. If you do not request the API to allocate a buffer, you can pass in a buffer allocated any way you like (malloc, HeapAlloc, IMalloc::Alloc, etc.).
It is possible to create programs on Windows without using any standard C library functions but most programs do and then you might as well use malloc over HeapAlloc. malloc will use HeapAlloc or VirtualAlloc internally but it is probably tuned for better performance/less fragmentation compared to the raw API. It also makes it easier to port to POSIX in the future. You will still be forced to use LocalFree/GlobalFree/HeapFree in some places where the API allocates memory for you.
Handling text needs special consideration and you need to decide if you need Unicode support or not. A stroll down memory lane might shed some light on why things are the way they are.
Back when Windows 95/98 was king you could use the char/CHAR narrow string types with both the C standard functions and the Windows API. There was virtually no Unicode support except for a handful of functions.
On Windows NT4/2000 however the native string type is WCHAR (UTF-16 LE but Microsoft just calls it Unicode). If you are using Microsoft Visual C++ then you have access to wide string versions of the C standard libray beyond what the C standard actually requires to ease coding for this platform. When coding for Windows using the Microsoft toolchain you can assume that the Windows SDK WCHAR type is the same as the wchar_t type defined by C.
Because the development of 95 and NT4 overlapped they share the same API and every function that receives/returns a string has two versions, one with a A suffix ("ANSI") and one with a W suffix. On Windows 95 the W functions are just stubs that return failure.
When you include Windows.h it will create defines like #define CreateProcess CreateProcessW if UNICODE is defined or #define CreateProcess CreateProcessA if not.
Visual C++ does the same thing with the tchar.h header. It uses the _UNICODE define to decide if the TCHAR type and the _t* functions use the char or wchar_t type. This meant that you could create two releases from the same source code, one for Windows 95/98/ME and one with full Unicode support.
This is not that relevant anymore but you still need to make a choice because things will be defined for one or the other.
It is still perfectly valid to do
#define UNICODE
#define _UNICODE
#include <windows.h>
#include <tchar.h>
void foo()
{
TCHAR buf[100];
SomeWindowsFunction(buf, 100);
_tprintf(_T("foo: %s\n"), buf);
}
although you will see many people go straight for WCHAR and wprintf these days.
The StrSafe functions were added to make it easier to write bug free code, they still have the same A/W duplication.
You cannot mix and match WCHAR with printf, even if you use %ls in the format string the string will be converted internally and not all Unicode strings will convert correctly.
If POSIX portability is not a requirement then I suggest that you use the wide function extensions provided by Microsoft when you need a C library function.
Note that different versions of the OS use different definitions of base types and use different alignments/padding. Remember 8086, 386 and now Core i7(16, 32, 64 bits).
When structs need to be compatible with earlier versions, they typically use pre-defined integer widths and pad for legacy alignment.
For that reason, the types from the API must be used in API calls.
Also in memory there used and maybe are different memory models of process memory and shared memory. For example the clipboard uses a form of shared memory. It is important to use the memory allocation mechanisms Microsoft advices here for API calls.
For everything non-API I use the standard C functions and types.

String safe functions vs security enhanced CRT

Are the StringCch* functions considered safer then the safe versions of the CRT string functions.
Is StringCchCatW safer then wcscat_s? Or StringCchCopyW vs wcscpy_s?
This is tagged C++, so you should not use either of them, but std::wstring instead.
It really depends on how you are dealing with strings in your application.
With the Windows SDK many people end up mixing a lot of c-runtime, and windows api calls - because the windows api is a C API layered on top of the c-runtime it is difficult for windows programmers to know they are doing this, and why it is (potentially) wrong.
But, basically, at some point of your application development, you really should choose where you are going to get your basic data types from. Especially in the case of string data as the localization apis, that effect sorting, collation and so on are quite different.
You can choose to use the c-runtimes string support. This involves using strings that are typed as "const char*" or wchar_t. Microsoft's c-runtime specifically (it is possible to develop windows applications with GCC, so this distinction can be important if you want to write code that compiles on multiple compilers) offers a set of functions in , based off storing text data in _TCHAR's, and using the _MBCS and _UNICODE macros to switch the functions from single character, to multi byte character to wide character supporting versions.
If you choose to use the c-runtimes string abstractions, then using the c-runtimes "safe" string routnes would make the most sense.
Alternatively, Windows GUI applications frequently choose to use the windows SDK data types: CHAR, INT, WCHAR, TCHAR, DWORD etc. Strings, would be represented by LPCTSTR variables usually, and in this case, continuing to use the windows abstractions, SafeStringCch etc. make the most sense.
Really, the worst thing you can do in a program is keep on bouncing between using windows API calls, and c-runtime calls, as they have different rules sometimes for how they deal with edge cases... and this can lead to some very hard to spot bugs.

How to walk a directory in C

I am using glib in my application, and I see there are convenience wrappers in glib for C's remove, unlink and rmdir. But these only work on a single file or directory at a time.
As far as I can see, neither the C standard nor glib include any sort of recursive directory walk functionality. Nor do I see any specific way to delete an entire directory tree at once, as with rm -rf.
For what I'm doing this I'm not worried about any complications like permissions, symlinks back up the tree (infinite recursion), or anything that would rule out a very naive
implementation... so I am not averse to writing my own function for it.
However, I'm curious if this functionality is out there somewhere in the standard libraries gtk or glib (or in some other easily reused C library) already and I just haven't stumbled on it. Googling this topic generates a lot of false leads.
Otherwise my plan is to use this type of algorithm:
dir_walk(char* path, void* callback(char*) {
if(is_dir(path) && has_entries(path)) {
entries = get_entries(path);
for(entry in intries) { dir_walk(entry, callback); }
}
else { callback(path) }
}
dir_walk("/home/user/trash", remove);
Obviously I would build in some error handling and the like to abort the process as soon as a fatal error is encountered.
Have you looked at <dirent.h>? AFAIK this belongs to the POSIX specification, which should be part of the standard library of most, if not all C compilers. See e.g. this <dirent.h> reference (Single UNIX specification Version 2 by the Open Group).
P.S., before someone comments on this: No, this does not offer recursive directory traversal. But then I think this is best implemented by the developer; requirements can differ quite a lot, so one-size-fits-all recursive traversal function would have to be very powerful. (E.g.: Are symlinks followed up? Should recursion depth be limited? etc.)
You can use GFileEnumerator if you want to do it with glib.
Several platforms include ftw and nftw: "(new) file tree walk". Checking the man page on an imac shows that these are legacy, and new users should prefer fts. Portability may be an issue with either of these choices.
Standard C libraries are meant to provide primitive functionality. What you are talking about is composite behavior. You can easily implement it using the low level features present in your API of choice -- take a look at this tutorial.
Note that the "convenience wrappers" you mention for remove(), unlink() and rmdir(), assuming you mean the ones declared in <glib/gstdio.h>, are not really "convenience wrappers". What is the convenience in prefixing totally standard functions with a "g_"? (And note that I say this even if I who introduced them in the first place.)
The only reason these wrappers exist is for file name issues on Windows, where these wrappers actually consist of real code; they take file name arguments in Unicode, encoded in UTF-8. The corresponding "unwrapped" Microsoft C library functions take file names in system codepage.
If you aren't specifically writing code intended to be portable to Windows, there is no reason to use the g_remove() etc wrappers.

Resources