should I eliminate TCHAR from Windows code? - c

I am revising some very old (10 years) C code. The code compiles on Unix/Mac with GCC and cross-compiles for Windows with MinGW. Currently there are TCHAR strings throughout. I'd like to get rid of the TCHAR and use a C++ string instead. Is it still necessary to use the Windows wide functions, or can I do everything now with Unicode and UTF-8?

Windows uses UTF16 still and most likely always will. You need to use wstring rather than string therefore. Windows APIs don't offer support for UTF8 directly largely because Windows supported Unicode before UTF8 was invented.
It is thus rather painful to write Unicode code that will compile on both Windows and Unix platforms.

Is it still necessary to use the
Windows wide functions, or can I do
everything now with Unicode and UTF-8?
Yes. Unfortunately, Windows does not have native support for UTF-8. If you want proper Unicode support, you need to use the wchar_t version of the Windows API functions, not the char version.
should I eliminate TCHAR from Windows code?
Yes, you should. The reason TCHAR exists is to support both Unicode and non-Unicode versions of Windows. Non-Unicode support may have been a major concern back in 2001 when Windows 98 was still popular, but not today.
And it's highly unlikely that any non-Windows-specific library would have the same kind of char/wchar_t overloading that makes TCHAR usable.
So go ahead and replace all your TCHARs with wchar_ts.
The code compiles on Unix/Mac with GCC and cross-compiles for Windows with MinGW.
I've had to write cross-platform C++ code before. (Now my job is writing cross-platform C# code.) Character encoding is rather painful when Windows doesn't support UTF-8 and Un*x doesn't support UTF-16. I ended up using UTF-8 as our main encoding and converting as necessary on Windows.

Yes, writing non-unicode applications nowadays is shooting yourself in the foot. Just use the wide API everywhere, and you'll not have to cry about it later. You can still use UTF8 on UNIX and wchar_t on Windows if you don't need (network) communication between platforms (or convert the wchar_t's with Win32 API to UTF-8), or go the hard way and use UTF-8 everywhere and convert to wchar_t's when you use Win32 API functions (that's what I do).

To directly answer your question:
Is it still necessary to use the Windows wide functions, or can I do everything now with Unicode and UTF-8?
No, (non-ASCII) UTF-8 is not accepted by the vast majority of Windows API functions. You still have to use the wide APIs.
One could similarly bemoan that other OSes still have no support for wchar_t. So you also have to support UTF-8.
The other answers provide some good advice on how to manage this in a cross-platform codebase, but it sounds as if you already have an implementation supporting different character types. As desirable as ripping that out to simplify the code might sound, don't.

And I predict that someday, although probably not before the year 2020, Windows will add UTF-8 support, simply by adding U versions of all the API functions, alongside A and W, plus the same kind of linker hack. The 8-bit A functions are just a translation layer over the native W (UTF-16) functions. I bet they could generate a U-layer semi-automatically from the A-layer.
Once they've been teased enough, long enough, about their '20th century' Unicode support...
They'll still manage to make it awkward to write, ugly to read and non-portable by default, by using carefully chosen macros and default Visual Studio settings.

Related

How to get extended locale information with Windows CRT API

I am working on a personal prooject in which I need to obtain full locale formatting information from a C locale.
I cannot simply use localeconv or localeconv_l since lconv does not provide all formatting information needed. To solve this on *NIX there are nl_langinfo and nl_langinfo_l functions, however they are not present on Windows.
What ways are there to obtain locale formatting information on Windows?
start with: GetUserDefaultUILanguage
Similar and related APIs include:
GetUserDefaultLocaleName
GetUserDefaultLCID
GetUserDefaultLangID
The Perl 5 open source C language code contains an emulation of nl_langinfo(), for Windows and other platforms that lack it. You can steal the code, though it is complicated by trying to work on a bunch of different platforms with a bunch of different configurations
A few fields aren't implemented such as the Japanese emperor era names. But anything in common use is available.
Start with this file: https://github.com/Perl/perl5/blob/blead/locale.c
The code continues to evolve

How to test my application is UNICODE Compatible or not?

I apologise if it is a silly question..I recently developed an application in windows with C and WinApi. I am in need to check whether application is UNICODE compatible or not. How can I test my machine? Is there any procedure followed to check UNICODE Compatibility. More over I dont have a chinese language machine or any other languages. I want this test to be done in my machine in which language is by default English.
Please provide some links if possible or a detailed procedure.
Great question. On Windows platform this is challenging indeed, because there are many different encodings and code pages supported and one can mix between them.
What I usually do is test the application on input which is a mix of two non-ASCII languages, such as a filename which is a mix of Russian and Hebrew letters, and see that the application is able to open this file, etc. You can copy this: "שלום привет hello" and see how it works for this kind of input.
Because we have two languages here, it is not possible to support with an ANSI codepage, so there will be no this kind of a bug, which is the most common.

iconv: conversion from UTF-16BE

Do all popular iconv implementations support conversion from UTF-16BE (i.e. UTF-16 with big-endian byte order)? GNU iconv supports this encoding, but what about the other implementations in common use? Specifically, what do mingw and the *BSDs support?
Should I rather do this conversion myself?
If it's a big deal for you, you have an easy way out. Just write an autoconf test for UTF-16BE support, and then make the configuration script fail with an error message if it's not present.
Then you can take your time to sift through the standards, or, just forget about the whole issue.
Since libiconv is LGPL and supports UTF-16BE (website), you can always point users to that. There are some projects that include libiconv rather than rely on platform implementations.

Where can I get started with Unicode-friendly programming in C?

So, I’m working on a plain-C (ANSI 9899:1999) project, and am trying to figure out where to get started re: Unicode, UTF-8, and all that jazz.
Specifically, it’s a language interpreter project, and I have two primary places where I’ll need to handle Unicode: reading in source files (the language ostensibly supports Unicode identifiers and such), and in ‘string’ objects.
I’m familiar with all the obvious basics about Unicode, UTF-7/8/16/32 & UCS-2/4, so on and so forth… I’m mostly looking for useful, C-specific (that is, please no C++ or C#, which is all that’s been documented here on SO previously) resources as to my ‘next steps’ to implement Unicode-friendly stuff… in C.
Any links, manpages, Wikipedia articles, example code, is all extremely welcome. I’ll also try to maintain a list of such resources here in the original question, for anybody who happens across it later.
A must read before considering anything else, if you’re unfamiliar with Unicode, and what an encoding actually is: http://www.joelonsoftware.com/articles/Unicode.html
The UTF-8 home-page: http://www.utf-8.com/
man 3 iconv (as well as iconv_open and iconvctl)
International Components for Unicode (via Geoff Reedy)
libbasekit, which seems to include light Unicode-handling tools
Glib has some Unicode functions
A basic UTF-8 detector function, by Christoph
International Components for Unicode provides a portable C library for handling unicode. Here's their elevator pitch for ICU4C:
The C and C++ languages and many operating system environments do not provide full support for Unicode and standards-compliant text handling services. Even though some platforms do provide good Unicode text handling services, portable application code can not make use of them. The ICU4C libraries fills in this gap. ICU4C provides an open, flexible, portable foundation for applications to use for their software globalization requirements. ICU4C closely tracks industry standards, including Unicode and CLDR (Common Locale Data Repository).
GLib has some Unicode functions and is a pretty lightweight library. It's not near the same level of functionality that ICU provides, but it might be good enough for some applications. The other features of GLib are good to have for portable C programs too.
GTK+ is built on top of GLib. GLib provides the fundamental algorithmic language constructs commonly duplicated in applications. This library has features such as (this list is not a comprehensive list):
Object and type system
Main loop
Dynamic loading of modules (i.e. plug-ins)
Thread support
Timer support
Memory allocator
Threaded Queues (synchronous and asynchronous)
Lists (singly linked, doubly linked, double ended)
Hash tables
Arrays
Trees (N-ary and binary balanced)
String utilities and charset handling
Lexical scanner and XML parser
Base64 (encoding & decoding)
I think one of the interesting questions is - what should your canonical internal format for strings be? The 2 obvious choices (to me at least) are
a) utf8 in vanilla c-strings
b) utf16 in unsigned short arrays
In previous projects I have always chosen utf-8. Why ; because its the path of least resistance in the C world. Everything you are interfacing with (stdio, string.h etc) will work fine.
Next comes - what file format. The problem here is that its visible to your users (unless you provide the only editor for your language). Here I guess you have to take what they give you and try to guess by peeking (byte order marks help)

C program cross platform differences on Windows and Unix OS

Is there any difference in C that is written in Windows and Unix?
I teach C as well as C++ but some of my students have come back saying some of the sample programs do not run for them in Unix. Unix is alien to me. Unfortunately no experience with it whatsoever. All I know is to spell it. If there are any differences then I should be advising our department to invest on systems for Unix as currently there are no Unix systems in our lab. I do not want my students to feel that they have been denied or kept away from something.
That kind of problems usually appear when you don't stick to the bare C standard, and make assumptions about the environment that may not be true. These may include reliance on:
nonstandard, platform specific includes (<conio.h>, <windows.h>, <unistd.h>, ...);
undefined behavior (fflush(stdin), as someone else reported, is not required to do anything by the standard - it's actually undefined behavior to invoke fflush on anything but output streams; in general, older compilers were more lenient about violation of some subtle rules such as strict aliasing, so be careful with "clever" pointer tricks);
data type size (the short=16 bit, int=long=32 bit assumption doesn't hold everywhere - 64 bit Linux, for example, has 64 bit long);
in particular, pointer size (void * isn't always 32 bit, and can't be always casted safely to an unsigned long); in general you should be careful with conversions and comparisons that involve pointers, and you should always use the provided types for that kind of tasks instead of "normal" ints (see in particular size_t, ptrdiff_t, uintptr_t)
data type "inner format" (the standard does not say that floats and doubles are in IEEE 754, although I've never seen platforms doing it differently);
nonstandard functions (__beginthread, MS safe strings functions; on the other side, POSIX/GNU extensions)
compiler extensions (__inline, __declspec, #pragmas, ...) and in general anything that begins with double underscore (or even with a single underscore, in old, nonstandard implementations);
console escape codes (this usually is a problem when you try to run Unix code on Windows);
carriage return format: in normal strings it's \n everywhere, but when written on file it's \n on *NIX, \r\n on Windows, \r on pre-OSX Macs; the conversion is handled automagically by the file streams, so be careful to open files in binary when you actually want to write binary data, and leave them in text mode when you want to write text.
Anyhow an example of program that do not compile on *NIX would be helpful, we could give you preciser suggestions.
The details on the program am yet to get. The students were from our previous batch. Have asked for it. turbo C is what is being used currently.
As said in the comment, please drop Turbo C and (if you use it) Turbo C++, nowadays they are both pieces of history and have many incompatibilities with the current C and C++ standards (and if I remember well they both generate 16-bit executables, that won't even run on 64 bit OSes on x86_64).
There are a lot of free, working and standard-compliant alternatives (VC++ Express, MinGW, Pelles C, CygWin on Windows, and gcc/g++ is the de-facto standard on Linux, rivaled by clang), you just have to pick one.
The language is the same, but the libraries used to get anything platform-specific done are different. But if you are teaching C (and not systems programming) you should easily be able to write portable code. The fact that you are not doing so makes me wonder about the quality of your training materials.
The standard libraries that ship with MSVC and those that ship with a typical Linux or Unix compiler are different enough that you are likely to encounter compatibility issues. There may also be minor dialectic variations between MSVC and GCC.
The simplest way to test your examples in a unix-like environment would be to install Cygwin or MSYS on your existing Windows kit. These are based on GCC and common open-source libraries and will behave much more like the C compiler environment on a unix or linux system.
Cygwin is the most 'unix like', and is based on a cygwin.dll, which is an emulation layer that emulates unix system calls on top of the native Win32 API. Generally anything that would compile on Cygwin is very likely to compile on Linux, as Cygwin is based on gcc and glibc. However, native Win32 APIs are not available to applications compiled on Cygwin.
MSYS/MinGW32 is designed for producing native Win32 apps using GCC. However, most of the standard GNU and other OSS libraries are available, so it behaves more like a unix environment than VC does. In fact, if you are working with code that doesn't use Win32 or unix specific APIs it will probably port between MinGW32 and Linux more easily than it would between MinGW32 and MSVC.
While getting Linux installed in your lab is probably a useful thing to do (Use VMWare player or some other hypervisor if you can't get funding for new servers) you can use either of the above toolchains to get something that will probably be 'close enough' for your purposes. You can learn unix as takes your fancy, and both Cygwin and MSYS will give you a unix-like environment that could give you a bit of a gentle intro in the meantime.
C syntax must be the same if both Windows and Unix compilers adhere to the same C standard. I was told that MS compilers still don't support C99 in full, although Unix compilers are up to speed, so it seems C89 is a lowest common denominator.
However in Unix world you typically will use POSIX syscalls to do system stuff, like IPC etc. Windows isn't POSIX system so it has different API for it.
There is this thing called Ansi C. As long as you code purely Ansi C, there should be no difference. However, this is a rather academic assumption.
In real life, I have never encountered any of my codes being portable from Linux to Windows and vice versa without any modification. Actually, this modificationS (definitely plural) turned out into a vast amout of pre-processor directives, such as #ifdef WINDOWS ... #endif and #ifdef UNIX ... #endif ... even more, if some parallel libs, such as OPENMPI were used.
As you may imagine, this is totally contrary to readable and debugable code, but that was what worked ;-)
Besides, you have got to consider things already mentioned: UTF-8 will sometimes knock out linux compilers...
There should be no difference between the C programming language under windows or *nix,cause the language is specified by the ISO standard.
The C language itself is the portable from Windows to Unix. But operating system details are different and sometimes those intrude into your code.
For instance Unix systems typically use only "\n" to separate lines in a text file, while most Windows tools expect to see "\r\n". There are ways to deal with this sort of difference in a way that gets the C runtime to handle it for you but if you aren't careful you know about them, it's pretty easy to write OS specific C code.
I could that you run a Unix in a Virtual Machine and use that to test your code before you share it with your students.
I think its critical that you familiarize yourself with unix right now.
An excellent way to do this is a with a Knoppix CD.
Try to compile your programs under Linux using gc, and when they don't work, track down the problems (#include <windows>?) and make it work. Then return to windows, and it'll likely compile ok.
In this way, you will discover your programs become cleaner and better teaching material, even for lab exercises on windows machines.
A common problem is that fflush(stdin) doesn't work on Unix.
Which is perfectly normal, since the standard doesn't define how the implementation should handle it.
The solution is to use something like this (untested):
do
{
int c = getchar();
}
while (c != '\n' && c != EOF);
Similarly, you need to avoid anything that causes undefined behavior.

Resources