iconv: conversion from UTF-16BE - c

Do all popular iconv implementations support conversion from UTF-16BE (i.e. UTF-16 with big-endian byte order)? GNU iconv supports this encoding, but what about the other implementations in common use? Specifically, what do mingw and the *BSDs support?
Should I rather do this conversion myself?

If it's a big deal for you, you have an easy way out. Just write an autoconf test for UTF-16BE support, and then make the configuration script fail with an error message if it's not present.
Then you can take your time to sift through the standards, or, just forget about the whole issue.
Since libiconv is LGPL and supports UTF-16BE (website), you can always point users to that. There are some projects that include libiconv rather than rely on platform implementations.

Related

Bootstrapping a cross-platform compiler

Suppose you are designing, and writing a compiler for, a new language called Foo, among whose virtues is intended to be that it's particularly good for implementing compilers. A classic approach is to write the first version of the compiler in C, and use that to write the second version in Foo, after which it becomes self-compiling.
This does mean you have to be careful to keep backup copies of the binary (as opposed to most programs where you only have to keep backup copies of the source); once the language has evolved away from the first version, if you lost all copies of the binary, you would have nothing capable of compiling the current version. So be it.
But suppose it is intended to support both Linux and Windows. As long as it is in fact running on both platforms, it can compile itself on each platform, no problem. Supposing however you lost the binary on one platform (or had reason to suspect it had been compromised by an attacker); now there is a problem. And having to safeguard the binary for every supported platform is at least one more failure point than I'm comfortable with.
One solution would be to make it a cross-compiler, such that the binary on either platform can target both platforms.
This is not quite as easy as it sounds - while there is no problem selecting the binary output format, each platform provides the system API in the form of C header files, which normally only exist on their native platform, e.g. there is no guarantee code compiled against the Windows stdio.h will work on Linux even if compiled into Linux binary format.
Perhaps that problem could be solved by downloading the Linux header files onto a Windows box and using the Windows binary to cross-compile a Linux binary.
Are there any caveats with that solution I'm missing?
Another solution might be to maintain a separate minimum bootstrap compiler in Python, that compiles Foo into portable C, accepting only that subset of the language needed by the main Foo compiler and performing minimum error checking and no optimization, the intent being that the bootstrap compiler will thus remain simple enough that maintaining it across subsequent language versions wouldn't cost very much.
Again, are there any caveats with that solution I'm missing?
What methods have people used to solve this problem in the past?
This is a problem for C compilers themselves. It's typically solved by the use of a cross-compiler, exactly as you suggest.
The process of cross-compiling a compiler is no more difficult than cross-compiling any other project: that is to say, it's trickier than you'd like, but by no means impossible.
Of course, you first need the cross-compiler itself. This probably means some major surgery to your build-configuration system, and you'll need some kind of "sysroot" taken from the target (header, libraries, anything else you'll need to reference in a build).
So, in the end it depends on how your compiler is structured. Either it's easier to re-bootstrap using historical sources, repeating each phase of language compatibility you went through in the first place (you did use source revision control, right?), or it's easier to implement a cross-compiler configuration. I can't tell you which from here.
For many years, the GCC compiler was always written only in standard-compliant C code for exactly this reason: they wanted to be able to bring it up on any OS, given only the native C compiler for that system. Only in 2012 was it decided that C++ is now sufficiently widespread that the compiler itself can be written in it. Even then, they're only permitting themselves a subset of the language. In future, if anybody wants to port GCC to a platform that does not already have C++, they will need to either use a cross-compiler, or first port GCC 4.7 (that last major C-only version) and then move to the latest.
Additionally, the GCC build process does not "trust" the compiler it was built with. When you type "make", it first builds a reduced version of itself, it then uses that the build a full version. Finally, it uses the full version to rebuild another full version, and compares the two binaries. If the two do not match it knows that the original compiler was buggy and introduced some bad code, and the build has failed.

should I eliminate TCHAR from Windows code?

I am revising some very old (10 years) C code. The code compiles on Unix/Mac with GCC and cross-compiles for Windows with MinGW. Currently there are TCHAR strings throughout. I'd like to get rid of the TCHAR and use a C++ string instead. Is it still necessary to use the Windows wide functions, or can I do everything now with Unicode and UTF-8?
Windows uses UTF16 still and most likely always will. You need to use wstring rather than string therefore. Windows APIs don't offer support for UTF8 directly largely because Windows supported Unicode before UTF8 was invented.
It is thus rather painful to write Unicode code that will compile on both Windows and Unix platforms.
Is it still necessary to use the
Windows wide functions, or can I do
everything now with Unicode and UTF-8?
Yes. Unfortunately, Windows does not have native support for UTF-8. If you want proper Unicode support, you need to use the wchar_t version of the Windows API functions, not the char version.
should I eliminate TCHAR from Windows code?
Yes, you should. The reason TCHAR exists is to support both Unicode and non-Unicode versions of Windows. Non-Unicode support may have been a major concern back in 2001 when Windows 98 was still popular, but not today.
And it's highly unlikely that any non-Windows-specific library would have the same kind of char/wchar_t overloading that makes TCHAR usable.
So go ahead and replace all your TCHARs with wchar_ts.
The code compiles on Unix/Mac with GCC and cross-compiles for Windows with MinGW.
I've had to write cross-platform C++ code before. (Now my job is writing cross-platform C# code.) Character encoding is rather painful when Windows doesn't support UTF-8 and Un*x doesn't support UTF-16. I ended up using UTF-8 as our main encoding and converting as necessary on Windows.
Yes, writing non-unicode applications nowadays is shooting yourself in the foot. Just use the wide API everywhere, and you'll not have to cry about it later. You can still use UTF8 on UNIX and wchar_t on Windows if you don't need (network) communication between platforms (or convert the wchar_t's with Win32 API to UTF-8), or go the hard way and use UTF-8 everywhere and convert to wchar_t's when you use Win32 API functions (that's what I do).
To directly answer your question:
Is it still necessary to use the Windows wide functions, or can I do everything now with Unicode and UTF-8?
No, (non-ASCII) UTF-8 is not accepted by the vast majority of Windows API functions. You still have to use the wide APIs.
One could similarly bemoan that other OSes still have no support for wchar_t. So you also have to support UTF-8.
The other answers provide some good advice on how to manage this in a cross-platform codebase, but it sounds as if you already have an implementation supporting different character types. As desirable as ripping that out to simplify the code might sound, don't.
And I predict that someday, although probably not before the year 2020, Windows will add UTF-8 support, simply by adding U versions of all the API functions, alongside A and W, plus the same kind of linker hack. The 8-bit A functions are just a translation layer over the native W (UTF-16) functions. I bet they could generate a U-layer semi-automatically from the A-layer.
Once they've been teased enough, long enough, about their '20th century' Unicode support...
They'll still manage to make it awkward to write, ugly to read and non-portable by default, by using carefully chosen macros and default Visual Studio settings.

Operating system agnostic C library

Is there a C library available for operations such as file operations, getting system information and the like which is generic, which can be used when compiled in different platforms and which behaves in a similar way?
Edit: Something like Java or .NET platform abstracting the hardware.
Have you tried the standard library? It should be implemented on any system that has an ISO compliant C runtime.
Yes; the ISO Standard C library. It may not cover all the functionality you want, but that is exactly because it is generic, and as such is also lowest common denominator. It only supports features that can reasonably be expected to exist on most hardware, including embedded systems.
The way to approach this is perhaps to specify the range of target platforms you need to support, and then the application domains (e.g. GUI, networking, multi-threading, image processing, file handling etc.), and then select the individual cross-platform libraries that suit your needs. There is probably no one library to fulfil all your needs, and in some cases no common library at all.
That said, you will always be better served in this respect by embracing C++ where you can use any C library as well as C++ libraries. Not only is the C++ standard library larger, but libraries such as Boost, wxWidgets, ACE cover a broader domain spectrum too. Another approach is to use a cross-platform language such as Java, which solves the problem by abstracting the hardware to a virtual machine. Similarly .NET/Mono and C# may provide a solution for suitably limited set of target platforms.
Added following comment:
Hardware abstraction in a real-machine targeted language (as opposed to a VM language such as Java or CLR based languages) is provided by the operating system, so what you perhaps need is a common operating system API. POSIX is probably the closest you will get to that, being supported on Linux, Unix, OSX (which is Unix), QNX, VxWorks, BeOS and many others; but not importantly Windows. One way of using POSIX on Windows is to use Cygwin. Another is to use a VM to host a POSIX OS such as Linux.
For anything not found in the standard library, GLib is a good first place to look along with other libraries built to interact with it. It offers for example threads, mutexes and IPC that you won't be able to write portably using plain standard libraries, and you can use many more GNU libraries that follow the same conventions up to a full GUI in GTK+. GLib supports the usual popular operating systems.
If the C standard library is not sufficient for your needs, a minimal hyperportable subset of POSIX might be the target you want to code to. For example, the main non-POSIX operating system Windows still has a number of functions with the same names as POSIX functions which behave reasonably closely - open, read, write, etc.
Documenting what exactly a "hyperportable subset of POSIX" includes, and which non-POSIX operating systems would conform to such a subset, is a moderately difficult task, which would be quite useful in and of itself for the sake of avoiding the plague of "MyCompanyName portable runtime" products which appear again and again every few years and unnecessarily bloat popular software like Firefox and Apache.
If you need facilities beyond what the Standard C library provides, take a look at the Apache Portable Runtime (APR). You should also review whether POSIX provides the functionality you are after, though that comes with its own bag of worms.
If you want to get into graphics and the like, then you are into a different world - GTK, Glib and Qt spring to mind, though I've not used any of them.

Where can I get started with Unicode-friendly programming in C?

So, I’m working on a plain-C (ANSI 9899:1999) project, and am trying to figure out where to get started re: Unicode, UTF-8, and all that jazz.
Specifically, it’s a language interpreter project, and I have two primary places where I’ll need to handle Unicode: reading in source files (the language ostensibly supports Unicode identifiers and such), and in ‘string’ objects.
I’m familiar with all the obvious basics about Unicode, UTF-7/8/16/32 & UCS-2/4, so on and so forth… I’m mostly looking for useful, C-specific (that is, please no C++ or C#, which is all that’s been documented here on SO previously) resources as to my ‘next steps’ to implement Unicode-friendly stuff… in C.
Any links, manpages, Wikipedia articles, example code, is all extremely welcome. I’ll also try to maintain a list of such resources here in the original question, for anybody who happens across it later.
A must read before considering anything else, if you’re unfamiliar with Unicode, and what an encoding actually is: http://www.joelonsoftware.com/articles/Unicode.html
The UTF-8 home-page: http://www.utf-8.com/
man 3 iconv (as well as iconv_open and iconvctl)
International Components for Unicode (via Geoff Reedy)
libbasekit, which seems to include light Unicode-handling tools
Glib has some Unicode functions
A basic UTF-8 detector function, by Christoph
International Components for Unicode provides a portable C library for handling unicode. Here's their elevator pitch for ICU4C:
The C and C++ languages and many operating system environments do not provide full support for Unicode and standards-compliant text handling services. Even though some platforms do provide good Unicode text handling services, portable application code can not make use of them. The ICU4C libraries fills in this gap. ICU4C provides an open, flexible, portable foundation for applications to use for their software globalization requirements. ICU4C closely tracks industry standards, including Unicode and CLDR (Common Locale Data Repository).
GLib has some Unicode functions and is a pretty lightweight library. It's not near the same level of functionality that ICU provides, but it might be good enough for some applications. The other features of GLib are good to have for portable C programs too.
GTK+ is built on top of GLib. GLib provides the fundamental algorithmic language constructs commonly duplicated in applications. This library has features such as (this list is not a comprehensive list):
Object and type system
Main loop
Dynamic loading of modules (i.e. plug-ins)
Thread support
Timer support
Memory allocator
Threaded Queues (synchronous and asynchronous)
Lists (singly linked, doubly linked, double ended)
Hash tables
Arrays
Trees (N-ary and binary balanced)
String utilities and charset handling
Lexical scanner and XML parser
Base64 (encoding & decoding)
I think one of the interesting questions is - what should your canonical internal format for strings be? The 2 obvious choices (to me at least) are
a) utf8 in vanilla c-strings
b) utf16 in unsigned short arrays
In previous projects I have always chosen utf-8. Why ; because its the path of least resistance in the C world. Everything you are interfacing with (stdio, string.h etc) will work fine.
Next comes - what file format. The problem here is that its visible to your users (unless you provide the only editor for your language). Here I guess you have to take what they give you and try to guess by peeking (byte order marks help)

C program cross platform differences on Windows and Unix OS

Is there any difference in C that is written in Windows and Unix?
I teach C as well as C++ but some of my students have come back saying some of the sample programs do not run for them in Unix. Unix is alien to me. Unfortunately no experience with it whatsoever. All I know is to spell it. If there are any differences then I should be advising our department to invest on systems for Unix as currently there are no Unix systems in our lab. I do not want my students to feel that they have been denied or kept away from something.
That kind of problems usually appear when you don't stick to the bare C standard, and make assumptions about the environment that may not be true. These may include reliance on:
nonstandard, platform specific includes (<conio.h>, <windows.h>, <unistd.h>, ...);
undefined behavior (fflush(stdin), as someone else reported, is not required to do anything by the standard - it's actually undefined behavior to invoke fflush on anything but output streams; in general, older compilers were more lenient about violation of some subtle rules such as strict aliasing, so be careful with "clever" pointer tricks);
data type size (the short=16 bit, int=long=32 bit assumption doesn't hold everywhere - 64 bit Linux, for example, has 64 bit long);
in particular, pointer size (void * isn't always 32 bit, and can't be always casted safely to an unsigned long); in general you should be careful with conversions and comparisons that involve pointers, and you should always use the provided types for that kind of tasks instead of "normal" ints (see in particular size_t, ptrdiff_t, uintptr_t)
data type "inner format" (the standard does not say that floats and doubles are in IEEE 754, although I've never seen platforms doing it differently);
nonstandard functions (__beginthread, MS safe strings functions; on the other side, POSIX/GNU extensions)
compiler extensions (__inline, __declspec, #pragmas, ...) and in general anything that begins with double underscore (or even with a single underscore, in old, nonstandard implementations);
console escape codes (this usually is a problem when you try to run Unix code on Windows);
carriage return format: in normal strings it's \n everywhere, but when written on file it's \n on *NIX, \r\n on Windows, \r on pre-OSX Macs; the conversion is handled automagically by the file streams, so be careful to open files in binary when you actually want to write binary data, and leave them in text mode when you want to write text.
Anyhow an example of program that do not compile on *NIX would be helpful, we could give you preciser suggestions.
The details on the program am yet to get. The students were from our previous batch. Have asked for it. turbo C is what is being used currently.
As said in the comment, please drop Turbo C and (if you use it) Turbo C++, nowadays they are both pieces of history and have many incompatibilities with the current C and C++ standards (and if I remember well they both generate 16-bit executables, that won't even run on 64 bit OSes on x86_64).
There are a lot of free, working and standard-compliant alternatives (VC++ Express, MinGW, Pelles C, CygWin on Windows, and gcc/g++ is the de-facto standard on Linux, rivaled by clang), you just have to pick one.
The language is the same, but the libraries used to get anything platform-specific done are different. But if you are teaching C (and not systems programming) you should easily be able to write portable code. The fact that you are not doing so makes me wonder about the quality of your training materials.
The standard libraries that ship with MSVC and those that ship with a typical Linux or Unix compiler are different enough that you are likely to encounter compatibility issues. There may also be minor dialectic variations between MSVC and GCC.
The simplest way to test your examples in a unix-like environment would be to install Cygwin or MSYS on your existing Windows kit. These are based on GCC and common open-source libraries and will behave much more like the C compiler environment on a unix or linux system.
Cygwin is the most 'unix like', and is based on a cygwin.dll, which is an emulation layer that emulates unix system calls on top of the native Win32 API. Generally anything that would compile on Cygwin is very likely to compile on Linux, as Cygwin is based on gcc and glibc. However, native Win32 APIs are not available to applications compiled on Cygwin.
MSYS/MinGW32 is designed for producing native Win32 apps using GCC. However, most of the standard GNU and other OSS libraries are available, so it behaves more like a unix environment than VC does. In fact, if you are working with code that doesn't use Win32 or unix specific APIs it will probably port between MinGW32 and Linux more easily than it would between MinGW32 and MSVC.
While getting Linux installed in your lab is probably a useful thing to do (Use VMWare player or some other hypervisor if you can't get funding for new servers) you can use either of the above toolchains to get something that will probably be 'close enough' for your purposes. You can learn unix as takes your fancy, and both Cygwin and MSYS will give you a unix-like environment that could give you a bit of a gentle intro in the meantime.
C syntax must be the same if both Windows and Unix compilers adhere to the same C standard. I was told that MS compilers still don't support C99 in full, although Unix compilers are up to speed, so it seems C89 is a lowest common denominator.
However in Unix world you typically will use POSIX syscalls to do system stuff, like IPC etc. Windows isn't POSIX system so it has different API for it.
There is this thing called Ansi C. As long as you code purely Ansi C, there should be no difference. However, this is a rather academic assumption.
In real life, I have never encountered any of my codes being portable from Linux to Windows and vice versa without any modification. Actually, this modificationS (definitely plural) turned out into a vast amout of pre-processor directives, such as #ifdef WINDOWS ... #endif and #ifdef UNIX ... #endif ... even more, if some parallel libs, such as OPENMPI were used.
As you may imagine, this is totally contrary to readable and debugable code, but that was what worked ;-)
Besides, you have got to consider things already mentioned: UTF-8 will sometimes knock out linux compilers...
There should be no difference between the C programming language under windows or *nix,cause the language is specified by the ISO standard.
The C language itself is the portable from Windows to Unix. But operating system details are different and sometimes those intrude into your code.
For instance Unix systems typically use only "\n" to separate lines in a text file, while most Windows tools expect to see "\r\n". There are ways to deal with this sort of difference in a way that gets the C runtime to handle it for you but if you aren't careful you know about them, it's pretty easy to write OS specific C code.
I could that you run a Unix in a Virtual Machine and use that to test your code before you share it with your students.
I think its critical that you familiarize yourself with unix right now.
An excellent way to do this is a with a Knoppix CD.
Try to compile your programs under Linux using gc, and when they don't work, track down the problems (#include <windows>?) and make it work. Then return to windows, and it'll likely compile ok.
In this way, you will discover your programs become cleaner and better teaching material, even for lab exercises on windows machines.
A common problem is that fflush(stdin) doesn't work on Unix.
Which is perfectly normal, since the standard doesn't define how the implementation should handle it.
The solution is to use something like this (untested):
do
{
int c = getchar();
}
while (c != '\n' && c != EOF);
Similarly, you need to avoid anything that causes undefined behavior.

Resources