wmain vs main C runtime - c

I have read few articles about different Windows C entry pooints, wmain and WinMain.
So, if I am correct, these are added to C language compilers for Windows OS. But, how are implemented?
For example, wmain gets Unicode as argv[], but its Os that sends these arguments to program, so is there any special field in the .exe file entry which says to windows to pass arguemnts as unicode? Thanks.

Modern versions of Windows internally use UTF-16. Therefore, when you launch an executable, all command line arguments likely are passed as UTF-16 from the onset, and the runtime library linked into the launched application either passes the arguments through unscathed (if using wmain) or converts them to the local encoding automatically (if using main). (Specifically this would be done by wmainCRTStartup/mainCRTStartup which are the actual entry points used for console Windows applications.)

First: a pedantic rant: wmain certainly does not get Unicode arguments. Unicode is defined independently of any particular encoding. wmain gets arguments in a 16 bit character encoding of Unicode, UTF-16 at a guess. I've just checked Microsoft's documentation on wmain and the links from it and it is clear that Microsoft had no clue about what Unicode is when they wrote it.
Anyway, the entry point of a program is defined by the linker. A C program always has a certain amount of prologue code that runs before main/wmain/WinMain/wWinMain. If one of the wide versions of main is used, the prologue code converts the characters in the environment from whatever character set they are in to the wide character version.

Related

How the encoding standard you use to save a c program affect during compilation

I created a basic, simple C program and saved it with the .c extension using the Unicode encoding standard, and did not compile correctly. An error occurred saying null character(s) ignored, but when I saved the same program using ASCII standard it compiled just fine.
What is the reason behind this? My compiler is gcc compiler
Thank you.
There is no encoding called "Unicode". Unicode is not an encoding. It is a standard for many many things including several encodings.
The encodings are such as UTF16-LE and UTF-8. I presume you're using Notepad.exe on Windows. Microsoft may call this UTF16 little-endian as "Unicode". It would represent each ASCII character as two bytes, one of which would be NULL byte.
As far as I know GCC never expects the file to be in UTF-16 encoding, so it just ignores these intervening null bytes...
What you need to do is get a proper text editor that uses proper terminology and save your files as UTF-8 or whatever lesser encoding the operating system happens to use from day to day.

How to deal with Unicode paths in a cross-platfrom C library?

I'm contributing to a C library. It has a function that takes a char* parameter for a file path name. The authors are mostly UNIX developers, and this works fine on unixes where char* mostly means UTF-8. (At least in GCC, the character set is configurable and UTF-8 is the default.)
However, char* means ANSI on Windows, which implies that it is currently impossible to use Unicode path names with this library on Windows, where wchar_t* should be used and only UTF-16 is supported. (A quick search on StackOverflow reveals that the ANSI Windows API functions can not be used with UTF-8.)
The question is, what is the right way to deal with this? We've come up with various ways to do it, but neither of us are Windows experts, so we can't really decide how to do it properly. Our goal is that the users of the library should be able to write cross-platform code that would work on unixes as well as windows.
Under the hood, the library has #ifdefs in place to differentiate between operating systems so that it can use POSIX functions on UNIXes and Win32 APIs on Windows.
So far, we've come up with the following possibilities:
Offer a separate windows-only function that accepts a wchar_t*.
Require UTF-16 on Windows and #ifdef the library header in such a way that the function would accept wchar_t* on Windows.
Add a flag that would tell the function to cast the given char* to wchar_t* and call the widechar Windows APIs.
Create a variant of the function that takes a file descriptor (or file handle on Windows) instead of a file path.
Always require UTF-8 (even on Windows), and then inside the function, convert UTF-8 to UTF-16 and call the widechar Windows APIs.
The problem with options 1-4 is that they would require the user to consciously take care of portability themselves. Option 5 sounds good, but I'm not sure if this is the right way to go.
I'm also open to other suggestions or ideas that can solve this. :)
Since portability is an important goal for you, I think it is imperative for your function semantics to be precisely defined. Among other things, that means that the arguments' types and meanings don't vary across platforms. So, if you have a function that accepts regular char based paths then it should accept such paths on all systems, and the encoding expected of those paths should be well-defined (which does not necessarily mean "the same"). That rules out options (2) and (3).
Moreover, portability requires the same functions to be usable across all platforms; that rules out (1). Option (4) could be ok if a stream- and/or file descriptor-based approach were the only one provided by your library, but it yields portability only with respect to those functions, not with respect to the path-based ones. (And note that stream (FILE *) APIs are defined by C, whereas file descriptors are a POSIX concept, not native to C. In principle, therefore, streams are more portable than file descriptors.)
(5) could work, but it places stronger constraints than you actually need. It is not essential for the function to define the encoding expected (though it can); it suffices for it to define how that encoding is determined.
Additionally, you could add wchar_t-based functions that work everywhere (as opposed to Windows-only). Those might be more convenient for Windows users. Similar to alternative (4), however, that provides portability only with respect to those functions. Supposing that you don't want to drop the char-based ones, you would need to pair this alternative with some variation on (5).

how to handle russian string as a command line argument in C program

I have an exe file build from C code. There is a situation where russian string is passed as an argument to this exe.
When I call exe with this argument, task manager shows russian string perfectly as command line argument.
But when I print that argument from my exe it just prints ???
How can I make my C program(hence exe) handle russian character?
The answer depends on a target platform for your program. Traditionally, a C- or C++-program begins its life from main(....) function which may have byte-oriented strings passed as arguments (notice char* in main declaration int main(int argc, char* argv[])). Byte-oriented strings mean that characters in a string are passed in a specific byte-oriented encoding and one character, for example Я or Ñ in UTF-8 may take more than 1 char.
Nowadays the most wide used encoding on Linux/Unix platform is UTF-8, but some time ago there were other encodings in use such as ISO8859-1, KOI8-R and a lot of others. Most of programs are still byte oriented as UTF-8 encoding is mostly backward-compatible with all traditional C strings API.
In other hand wide strings can be more convenient in use, because each character in a widestring uses a predefined space. Thus, for example, the following expression passes assertion test: std::wstring hello = L"Привет!¡Hola!"; assert(L'в' == hello[3]); (if UTF-8 char strings are used the test would fail). So if your program performs a lot of operations on letters, not strings as a whole, then widestrings can be the solution.
To convert strings from multi-byte to a wide character encoding, you may use mbtowc functions family or that awesome fancy codecvt C++-11 facility if your compiler supports it (likely it doesn't as of mid-2014 :))
In Windows strings are also can be passed as byte-oriented strings, and for Russian most likely CP1251 is used (depends on Operating system settings, but for Windows sold within Russia and CIS this is the most popular variant). Also MSVC has a language extension which allows an application programmer to avoid all this complexity with manual conversion of bytestring to widestrings, and use a variant of main() function which instantly receives widestrings
#user3159253 provided a good answer that I will complete with some more references:
Windows: Usually it uses wide characters.
Linux: Normally it uses UTF-8 encoding: please do NOT use wide chars in this case.
You are facing an internationalization (cf i18n, i10n ) issue.
You might need tools like iconv for character set conversion, and gettext for string translation.

Argument treatment at command line in C

The question asked to me is that even if we supplied integer/float arguments at the command prompt they are still treated as strings or not in C language. I am not sure about that can any help just little bit. Is this true or not in C language and why? And what about others like Java or python ?
It is true, independent of the language, that the command line arguments to programs on Unix are strings. If the program chooses to interpret them as numbers, that is fine, but it is because the program (or programmer) chose to do so. Similarly, the runtime support for a language might alter the arguments passed by the o/s into integer or float types, but the o/s passes strings to that runtime (I know of no language that does this, but I don't claim to know all languages).
To see this, note that the ways to execute a program are the exec*() family of functions, and each of those takes a string which is the name of the program to be executed, and an array of strings which are the arguments to be passed to the program. Functions such as system() and popen() are built atop the exec*() family of functions — they also use fork(), of course. Even the posix_spawn() function takes an array of pointers to strings for the arguments of the spawned program.
It's not unlike mailing a letter without an envelope. We all agree to use the common enclosure known as an envelope. Operating systems pass parameters to programs using the common item known as a string of characters. It's beyond the scope of the operating system to understand what the program wants to do with the parameters.
There are some exceptions, one which comes to mind is the passing of parameters to a Linux Kernel Module. These can be passed as items other than strings.
Basically, this is an issue of creating an interface between the operating system and the program. any program. Remember that programs are not always written in C, And you don't even know whether there are things like float or int in the language.
You want to be able to pass several arguments (with natural delimiters), which may easily encode arbitrary information. In C, strings can be of arbitrary length, and the only constraint on them is that a zero byte in them signifies the end of the string. this is a highly flexible and natural way to pass arbitrary information to the program.
So you can never supply an arbitrary integer/float arguments directly to a program; The operating system (Unix / Linux / Windows / etc.) won't let you. You don't have any tool that gives you that interface, in the same way that you can't pass a mouse-click as an argument. All you supply is a sequence of characters.
Since Unix and C were designed together, it is also part of the C programming language, and from there it worked its way to C++, Java, Python and most other modern programming languages, and the same way into Linux, Windows and most other operating systems.

A Windows C compiler that doesn't split arguments in its runtime libraries?

I have heard that in Windows, parameters are passed a single parameter, and then the program splits it into arguments, either in its runtime libraries, or sometimes, in the actual code.
I've heard that most C/C++ compilers do it in runtime libararies (for example, TCC - Tiny C Compiler, which I downloaded)
Are there any C compilers I can download, that don't? Any links to them?
And in such a compiler, would argsv[0] have the whole string?
Added
It's based on what this person (jdedb) said in Super User question Can't pipe or redirect Cygwin grep output, after seeming to suggest that I ask on Stack Overflow.
"It's up to the called program to split the command tail into words, if it wants to operate in Unix (and C language) fashion. (The runtime support libraries of most C and C++ language implementations for Win32 do this splitting behind the scenes."
He said it's the compilers.. But according to Necrolis, it's not the compiler.
(added- Necrolis commented correcting my misreading, compiler!=runtime library)
If you are on Windows, just use GetCommandLine. This is how most CRT wrappers get the command line to split to start with.
As for your actual question, it's not the compiler, but the CRT startup wrapper that they use. If you implement mainCRTstartup, and override the entrypoint with it, you can do whatever you want. A good example of how it works can be seen here.
That "parameter splitting" is the way mandated by the C99 Standard (PDF file) in 5.1.2.2.1.
If an implementation (compiler + library + options) recognizes but does not separate the program name from the other parameters (and parameters from each other) it is not conforming.
Of course, if you use a free-standing implementation none of this applies.

Resources