Non-ascii filename for fopen() - c

I need a robust cross-platform solution to read a specific binary file in C. Let's say I want to fopen() such (maybe big) file, allocate a temporary buffer, and then fread() a sequence of bytes
to update my SHA1_CTX and finally close my FILE, finalize sha1 and go on. Quite trivial, right?
But, there is one thing I doubt: What if the filename is not ASCII?
Let's say I will have:
/Users/me/Projects/my_file.bin
/home/me/файлы/работа/мой_файл.bin
С:\\我的檔案\\我的工作.bin
D:\\Folder🙈\\🙂👍😘.bin
Can fopen handle such paths? If not, what can I do? I may write some platform-specific code or look for some cross-platform library, but it is extremely important for my application to be as small as possible, moreover it is written in C, so QT, Boost, etc., are not applicable.
Thanks.

On essentially every platform except Windows, the expectation is that you pass filenames to the standard functions as normal char[] strings represented in the character encoding of the locale that's being used, and on all modern systems that will be UTF-8. You can either:
honor this by ensuring that you call setlocale(LC_ALL,"") (or setlocale(LC_CTYPE,"") if you don't want to use other locale features) and treating all local text input and output as being in whatever that encoding is (making users happy but possibly making trouble when some external input (e.g. from network) in UTF-8 is not representable, or
just always work in UTF-8, and hope passing UTF-8 strings through to filesystem access functions works by virtue of them being abstract byte arrays.
Unfortunately none of this works on Windows, but it will work in the near future. It also works if you build your application with Cygwin or midipix. Short of that, you need shims to make things work on Windows, and it's a huge pain.

It is operating system specific and file system specific.
You might not know what encoding is used for the file path. The user of your program should know that.
However, in 2018, UTF-8 tend to be used everywhere. In practice, that is not always the case today (specially on Windows).
BTW, different OSes have different restrictions on the file path. On Linux, in principle, you could have a file name containing only a tab and a return character (of course that is very poor taste, and nobody does that in practice; for details read path_resolution(7)). On Windows, that is not allowed.
Can fopen handle such paths?
Yes. The C11 standard (read n1570 for details) does not speak of character encoding.
A different question is what your particular implementation is doing with such paths. The evil is in the details, and they could be ugly.

Related

What does character encoding in C programming language depend on?

What does character encoding in C programming language depend on? (OS? compiler? or editor?)
I'm working on not only characters of ASCII but also ones of other encodings such as UTF-8.
How can we check the current character encodings in C?
The C source code might be stored in distinct encodings. This is clearly compiler dependent (i.e. a compiler setting if available). Though, I wouldn't count on it and count on ASCII-only always. (IMHO this is the most portable way to write code.)
Actually, you can encode any character of any encoding using only ASCIIs in C source code if you encode them with octal or hex sequences. (This is what I do from time to time to earn respect of my colleagues – writing German texts with \303\244, \303\266, \303\274, \303\231 into translation tables out of mind...)
Example: "\303\274" encodes the UTF-8 sequence for a string constant "ü". (But if I print this on my Windows console I only get "��" although I set code page 65001 which should provide UTF-8. The damn Windows console...)
The program written in C may handle any encoding you are able to deal with. Actually, the characters are only numbers which can be stored as one of the available integral types (e.g. char for ASCII and UTF-8, other int types for encodings with 16 or 32 bit wide characters). As already mentioned by Clifford, the output decides what to do with these numbers. Thus, this is platform dependent.
To handle characters according to a certain encoding, (e.g. make it upper case or lower case, local dictionary-like sorting, etc.) you have to use an appropriate library. This might be part of the standard libaries, the system libraries, or 3rd party libraries.
This is especially true for conversion from one encoding to another. This is a good point to mention libintl.
I personally prefer ASCII, Unicode, and UTF-8 (and unfortunately UTF-16 as I'm doing most work on Windows 10). In this special case, the conversion can be done by a pure "bit-fiddling" algorithm (without any knowledge of special characters). You may have a look at Wikipedia UTF-8 to get a clue. By google, you probably will find something ready-to-use if you don't want to do it by yourself.
The standard library of C++11 and C++14 provides support also (e.g. std::codecvt_utf8) but it is remarked as deprecated in C++17. Thus, I don't need to throw away my bit-fiddling code (I'm so proud of). Oops. This is tagged with c – sorry.
It is platform or display device/framework dependent. The compiler does not care how the platform interprets either char or wchar_t when such values are rendered as glyphs on some display device.
If the output were to some remote terminal, then the rendering would be dependent on the terminal rather than the execution environment, while in a desktop computer, the rendering may be to a text console or to a GUI, and the resulting rendering may differ even between those.

Do modern terminals generally render all utf-8 characters correctly?

I am writting an application in C that will be ran in a terminal, and it would be handy but not necesary to use some of the less used unicode characters. From my experimentation, I have not had any trouble rendering them. However, I would not use any non ascii characters if it were a likely source of trouble in the future.
So, in short, can I count on just about any terminal or terminal emulator in the modern *nix world (mainly linux, freebsd, and osx) to properly render arbitrary utf-8 characters?
If I cannot make such an assumption, there are particular subsets of unicode characters defined for various purposes, so would some such subset at least be reliably rendered in any likely modern *nix terminal or terminal emulator?
NOTE: When I say arbitrary, I do mean arbitrary: any unicode characters. But for completeness of my question, I will note that I am primarily interested in arrows and mathematical characters, this link has lists of both: https://en.wikipedia.org/wiki/Unicode_symbols.
No, you should not assume that. Even in a modern system, the set of fonts installed, the font used by the terminal application, and environment variables such as LANG, LC_*, etc. may influence whether certain characters can be displayed correctly on the terminal or not.
You might be able to make reasonable guesses based on the value of the TERM, LANG, and LC_* environment variable as to what is supported, but it's still going to be a guess. I'd suggest either not relying on it at all or providing some means of enabling/disabling the use (via an environment variable and/or via commandline flags to the application).
For the most part, this depends on the font, not the terminal. But there are a couple of things the terminal software has to take into account. For example, halfwidth and fullwidth forms of CJK characters.
Also, Unicode characters are added on a regular basis. There's no way that every font and terminal software is automatically updated as soon as a new version of the Unicode standard is released.
In general, you should assume that there are always Unicode characters that are not rendered correctly, even on a modern terminal.

How feasible is it to virtualise the FILE* interfaces of C?

It have often noticed that I would have been able to solve practical problems in C elegantly if there had been a way of creating a ‘virtual FILE’ and attaching the necessary callbacks for events such as buffer full, input requested, close, flush. It should then be possible to use a large part of the stdio.h functions, e.g. fprintf unchanged. Is there a framework enabling one to do this? If not, is it feasible with a moderate amount of effort, on at least some platforms?
Possible applications would be:
To write to or read from a dynamic or static region of memory.
To write to multiple files in parallel.
To read from a thread or co-routine generating data.
To apply a filter to another (virtual or real) FILE.
Support for file formats with indirection (like #include).
A C pre-processor(?).
I am less interested in solutions for specific cases than in a framework to let you roll your own FILE. I am also not looking for a virtual filesystem, but rather virtual FILE*s that I can pass to the CRT.
To my disappointment I have never seen anything of the sort; as far as I can see C11 considers FILE entirely up to the language implementer, which is perhaps reasonable if one wishes to keep the language (+library) specifications small but sad if you compare it with Java I/O streams.
I feel sure that virtual FILEs must be possible with any (fully) open source implementation of the C run-time, but I imagine there might be a large number of details making it trickier than it seems, and if it has already been done it would be a shame to reduplicate the effort. It would also be greatly preferable not to have to modify the CRT code. Without open source one might be able to reverse engineer the functions supplied, but I fear the result would be far too vulnerable to changes in unsupported features, unless there were a commitment to a set of interfaces. I suppose too that any system for which one can write a device driver would allow one to create a virtual device, but I suspect that of being unnecessarily low-level and of requiring one to write privileged code.
I have to admit that while I have code that would have benefited from virtual FILEs, I have no current requirement for it; nonetheless it is something I have often wondered about and that I imagine could be of interest to others.
This is somewhat similar to a-reader-interface-that-consumes-files-and-char-in-c, but there the questioner did not hope to return a virtual FILE; the answer, however, using fmemopen, did.
There is no standard C interface for creating virtual FILE*s, but both the GNU and the BSD standard libraries include one. On linux (glibc), you can use fopencookie; on most *BSD systems, funopen (including Mac OS X). (See Note 1)
The two interfaces are similar but slightly different in some details. However, it is usually very simple to adapt code written for one interface to the other.
These are not complete virtualizations. They associated the FILE* with four callbacks and a void* context (the "cookie" in fopencookie). The callbacks are read, write, seek and close; there are no callbacks for flush or tell operations. Still, this is sufficient for many simple FILE* adaptors.
For a simple example, see the two answers to Write simultaneousely to two streams.
Notes:
funopen is derived from "functional open", not from "file unopen".

Why do file formats have magic numbers?

For example, Portable Executable has several, including the famous "MZ" at the beginning, as well as the "PE\0\0" at the start of the PE header. The Rar file format has the "Rar!" header at the beginning, and several others have similar "magic values" in the file.
What purpose do such magic values serve?
Because users change the file extension, or other programs steal the file extension, it allows the application to cancel processing of a file in an unknown format instead of trying its best and then failing anyway.
the concept of magic numbers goes back to unix and pre-dates the use of file extensions.
The original idea of the shell was that all 'executable' would look the same - it didn't matter how the file had been created or what program should be used to evaluate it. The shell would look at the contents of the file and determine the appropriate file. Microsoft came along and chose a different approach and the era of file extensions was born. Then to make things 'nicer' for users microsoft chose to 'hide' these extensions and the era of trojan files which look like they are of one type but really have a different extension and are processed by a different file was born.
If two applications store data differently, but are constructed such that a file for one might possibly also be a valid (but meaningless) file for the other, very bad things can happen. A program may think it has successfully loaded the file (unaware that the data is meaningless) and then write back a file which to it would be semantically identical, but which would no longer be meaningfully readable by the application that wrote it (or anything else for that matter).
Using magic numbers doesn't entirely prevent this, but it can help at least somewhat.
BTW, trying to guess about the format of data is often very dangerous. For example, suppose one has a list of what are probably dates in the format nn-nn-nn. If one doesn't know what format the dates are in, there may be enough information to pretty well guess the format (e.g. if one of the records is 12-31-99, then absent information to the contrary, the dates are probably mm-dd-yy) but if all dates are within the first 12 days of a month, the data could easily be misinterpreted. Suppose, though, the data were preceded by something saying "MM-DD-YY". Then the risks of misinterpretation could be reduced.
To quickly identify the type of the file, or the positions within it.
Your question should not be “why do file formats have magic number”, but rather “what are the advantages of file formats having magic number”!
Suggestions:
Programs that undelete files by reading disk free space may recognize file types
Your UNIX knows whether an executable file is to be interpreted (she-bang) or is binary
When you lose extensions, programs like file can detect what your files are
Designer of file formats consider it is always safer when applications can easily ensure they are reading a file which has the good format.
As you have a header, it does not cost much to put it at header start.

How do I check if a file is text-based?

I am working on a small text replacement application that basically lets the user select a file and replace text in it without ever having to open the file itself. However, I want to make sure that the function only runs for files that are text-based. I thought I could accomplish this by checking the encoding of the file, but I've found that Notepad .txt files use Unicode UTF-8 encoding, and so do MS Paint .bmp files. Is there an easy way to check this without placing restrictions on the file extensions themselves?
Unless you get a huge hint from somewhere, you're stuck. Purely by examining the bytes there's a non-zero probability you'll guess wrong given the plethora of encodings ("ASCII", Unicode, UTF-8, DBCS, MBCS, etc). Oh, and what if the first page happens to look like ASCII but the next page is a btree node that points to the first page...
Hints can be:
extension (not likely that foo.exe is editable)
something in the stream itself (like BOM [byte-order-marker])
user direction (just edit the file, goshdarnit)
Windows used to provide an API IsTextUnicode that would do a probabilistic examination, but there were well-known false-positives.
My take is that trying to be smarter than the user has some issues...
Honestly, given the Windows environment that you're working with, I'd consider a whitelist of known text formats. Windows users are typically trained to stick with extensions. However, I would personally relax the requirement that it not function on non-text files, instead checking with the user for goahead if the file does not match the internal whitelist. The risk of changing a binary file would be mitigated if your search string is long - that is assuming you're not performing Y2K conversion (a la sed 's/y/k/g').
It's pretty costly to determine if a file is text-based or not (i.e. a binary file). You would have to examine each byte in the file to determine if it is a valid character, irrespective of the file encoding.
Others have said to look at all the bytes in the file and see if they're alphanumeric. Some UNIX/Linux utils do this, but just check the first 1K or 2K of the file as an "optimistic optimization".
well a text file contains text, right ? so a really easy way to check a file if it does contain only text is to read it and check if it does contains alphanumeric characters.
So basically the first thing you have to do is to check the file encoding if its pure ASCII you have an easy task just read the whole file in to a char array (I'm assuming you are doing it in C/C++ or similar) and check every char in that array with functions isalpha and isdigit ...of course you have to take care about special exceptions like tabulators '\t' space ' ' or the newline ('\n' in linux , '\r'\'n' in windows)
In case of a different encoding the process is the same except the fact that you have to use different functions for checking if the current character is an alphanumeric character... also note that in case of UTF-16 or greater a simple char array is simply to small...but if you are doing it for example in C# you dont have to worry about the size :)
You can write a function that will try to determine if a file is text based. While this will not be 100% accurate, it may be just enough for you. Such a function does not need to go through the whole file, about a kilobyte should be enough (or even less). One thing to do is to count how many whitespaces and newlines are there. Another thing would be to consider individual bytes and check if they are alphanumeric or not. With some experiments you should be able to come up with a decent function. Note that this is just a basic approach and text encodings might complicate things.

Resources