Learning C: strange code, what does it do? - c

I'm exploring wxWidgets and at the same time learning C/C++. Often wxWidgets functions expect a wxString rather than a string, therefore wxWidgets provides a macro wxT(yourString) for creating wxStrings. My question concerns the expansion of this macro. If you type wxT("banana") the expanded macro reads L"banana". What meaning does this have in C? Is L a function here that is called with argument "banana"?

"banana" is the word written using 1-byte ASCII characters.
L"banana" is the word written using multi-byte (general 2=byte UNICODE) characters.

L is a flag on strings to let it know it's a wide (unicode) string.

The L tells your compiler that it's a unicode string instead of a "normal" one.

Related

Translating String Interpolation to C

I'm well on my way on a programming language I've written in Java that compiles directly into C99 code. I want to add string interpolation functionality and am not sure what the resulting C code would be. In Ruby, you can interpolate strings: puts "Hello #{name}!" What would be the equivalent in C?
So called interpolated strings are really just expressions in disguise, consisting of string concatenation of the various parts of the string content, alternating string literal fragments with interpolated subexpressions converted to string values.
The interpolated string
"Hello #{name}!"
is equivalent to
concatenate(concatenate("Hello",toString(name)),"!")
The generalization to more complicated interpolated strings should be obvious.
You can compile it to the equivalent of this in C. You will need a big library of type-specific toString operations to match the types in your language. User defined types will be fun.
You may be able to implement special cases of this using "sprintf", which is the string-building version of C's "printf" library function, in cases where the types of the interpolated expressions match the limited set of types that printf format strings can handle (e.g., native ints and floats).
The printf family would be good to read, as is the scanf family of functions in the C library.

Are trigraphs required to write a newline character in C99 using only ISO 646?

Assume that you're writing (portable) C99 code in the invariant set of ISO 646. This means that the \ (backslash, reverse solidus, however you name it) can't be written directly. For instance, one could opt to write a Hello World program as such:
%:include <stdio.h>
%:include <stdlib.h>
int main()
<%
fputs("Hello World!??/n", stdout);
return EXIT_SUCCESS;
%>
However, besides digraphs, I used the ??/ trigraph to write the \ character.
Given my assumptions above, is it possible to either
include the '\n' character (which is translated to a newline in <stdio.h> functions) in a string without the use of trigraphs, or
write a newline to a FILE * without using the '\n' character?
For stdout you could just use puts("") to output a newline. Or indeed replace the fputs in your original program with puts and delete the \n.
If you want to get the newline character into a variable so you can do other things with it, I know another standard function that gives you one for free:
int gimme_a_newline(void)
{
time_t t = time(0);
return strchr(ctime(&t), 0)[-1];
}
You could then say
fprintf(stderr, "Hello, world!%c", gimme_a_newline());
(I hope all of the characters I used are ISO646 or digraph-accessible. I found it surprisingly difficult to get a simple list of which ASCII characters are not in ISO646. Wikipedia has a color-coded table with not nearly enough contrast between colors for me to tell what's what.)
Your premise:
Assume that you're writing (portable) C99 code in the invariant set of ISO 646. This means that the \ (backslash, reverse solidus, however you name it) can't be written directly.
is questionable. C99 defines "source" and "execution" character sets, and requires that both include representations of the backslash character (C99 5.2.1). The only reason I can imagine for an effort such as you describe would be to try to produce source code that does not require character set transcoding upon movement among machines. In that case, however, the choice of ISO 646 as a common baseline is odd. You're more likely to run into an EBCDIC machine than one that uses an ISO 646 variant that is not coincident with the ISO-8859 family of character sets. (And if you can assume ISO 8859, then backslash does not present a problem.)
Nevertheless, if you insist on writing C source code without using a literal backslash character, then the trigraph for that character is the way to do so. That's what trigraphs were invented for. In character constants and string literals, you cannot portably substitute anything else for \n or its trigraph equivalent, ??/n, because it is implementation-dependent how that code is mapped. In particular, it is not safe to assume that it maps to a line-feed character (which, however, is included among the invariant characters of ISO 646).
Update:
You ask specifically whether it is possible to
include the '\n' character (which is translated to a newline in functions) in a string without the use of trigraphs, or
No, it is not possible, because there is no one '\n' character. Moreover, there seems to be a bit of a misconception here: \n in a character or string literal represents one character in the execution character set. The compiler is therefore responsible for that transformation, not the stdio functions. The stdio functions' responsibility is to handle that character on output by writing a character or character sequence intended to produce the specified effect ("[m]oves the active position to the initial position of the next line").
You also ask whether it is possible to
write a newline to a FILE * without using the '\n' character?
This one depends on exactly what you mean. If you want to write a character whose code in the execution character set you know, then you can write a numeric constant having that numeric value. In particular, if you want to write the character with encoded value 0xa (in the execution character set) then you can do so. For example, you could
fputc(0xa, my_file);
but that does not necessarily produce a result equivalent to
fputc('\n', my_file);
Short answer is, yes, for what you want to do, you have to use this trigraph.
Even if there was a digraph for \, it would be useless inside a string literal because digraphs must be tokens, they are recognized by the tokenizer, while trigraphs are pre-processed and so still work inside string literals and the like.
Still wondering why somebody would encode source this way today ... :o
No. \n (or its trigraph equivalent) is the portable representation of a newline character.
No. You'd have to represent the literal newline somehow, and \n (or it's trigraph equivalent) is the only portable representation.
It's very unusual to find C source code that uses trigraphs or digraphs! Some compilers (e.g. GNU gcc) require command-line options to enable the use of trigraphs and assume they have been used unintentionally and issues a warning if it encounters them in the source code.
EDIT: I forgot about puts(""). That's a sneaky way to do it, but only works for stdout.
Yes of course it's possible
fputc(0x0A, file);

C11 Unicode Support

I am writing some string conversion functions similar to atoi() or strtoll(). I wanted to include a version of my function that would accept a char16_t* or char32_t* instead of just a char* or wchar_t*.
My function works fine, but as I was writing it I realized that I do not understand what char16_t or char32_t are. I know that the standard only requires that they are an integer type of at least 16 or 32 bits respectively but the implication is that they are UTF-16 or UTF-32.
I also know that the standard defines a couple of functions but they did not include any *get or *put functions (like they did when they added in wchar.h in C99).
So I am wondering: what do they expect me to do with char16_t and char32_t?
That's a good question with no apparent answer.
The uchar.h types and functions added in C11 are largely useless. They only support conversions between the new type (char16_t or char32_t) and the locale-specific, implementation-defined multibyte encoding, mappings which are not going to be complete unless the locale is UTF-8 based. The useful conversions (to/from wchar_t, and to/from UTF-8) are not supported. Of course you can roll your own for conversions to/from UTF-8 since these conversions are 100% specified by the relevant RFCs/UCS/Unicode standards, but be careful: most people implement them wrong and have dangerous bugs.
Note that the new compiler-level features for UTF-8, UTF-16, and UTF-32 literals (u8, u, and U, respectively) are potentially useful; you can process the resulting strings with your own functions in meaningful ways that don't depend at all on locale. But the library-level support for Unicode in C11 is, in my opinion, basically useless.
Testing if a UTF-16 or UTF-32 charter in the ASCII range is one of the "usual" 10 digits, +, - or a "normal" white-space is easy to do as well as convert '0'-'9' to a digit. Given that, atoi_utf16/32() proceeds like atoi(). Simply inspect one character at a time.
Testing if some other UTF-16/UTF-32 is a digit or white-space - that is harder. Code would need an extended isspace(), isdigit() which can be had be switching locales (setlocale()) if the needed locale is available. (Note: likely need to restore locale when the function is done.
Converting a character that passes isdigit() but is not one of the usual 10 to its value is problematic. Anyways, that appears to not even be allowed.
Conversion steps:
Set locale to a corresponding one for UTF-16/UTF-32.
Use isspace() for white-space detection.
Convert is a similar fashion for your_atof().
Restore local.
This question may be a bit old, but I'd like to touch on implementing your functions with char16_t and char32_t support.
The easiest way to do this is to write your strtoull function using the char32_t type (call it something like strtoull_c32). This makes parsing unicode easier because every character in UTF-32 occupies four bytes. Then implement strtoull_c16 and strtoull_c8 by internally converting both UTF-8 and UTF-16 encodings to UTF-32 and passing them to strtoull_c32.
I honestly haven't looked at the Unicode facilities in the C11 standard library, but if they don't provide a suitable way for converting those types to UTF-32 then you can use a third party library to make the conversion for you.
There's ICU, which was started by IBM and then adopted by the Unicode Consortium. It's a very feature-rich and stable library that's been around for a long time.
I started a UTF library (UTFX) for C89 recently, that you could use for this too. It's pretty simple and lightweight, unit tested and documented. You could give that a go, or use it to learn more about how UTF conversions work.

Is there a built-in way to parse standard escape sequences in a user-supplied string in C?

In C, if I put a literal string like "Hello World \n\t\x90\x53" into my code, the compiler will parse the escape sequences into the correct bytes and leave the rest of characters alone.
If the above string is instead supplied by the user, either on the command line or in a file, is there a way to invoke the compiler's functionality to get the same literal bytes into a char[]?
Obviously I could manually implement the functionality by hardcoding the escape sequences, but I would prefer not to do that if I can just invoke some compiler library instead.
No, there's no standard function to do that.
A suggestion for a non-standard library solution is to use glib's g_strcompress() function.

what does the `TEXT` around the format string mean in "printf"

The following prints the percentage of memory used.
printf (TEXT("There is %*ld percent of memory in use.\n"),
WIDTH, statex.dwMemoryLoad);
WIDTH is defined to be equal to 7.
What does TEXT mean, and where is this sort of syntax defined in printf?
As others already said, TEXT is probably a macro.
To see what they become, simply look at the preprocessor output. If are using gcc:
gcc -E file.c
Just guessing but TEXT is a char* to char* function that takes care of translating a text string for internationalization support.
Note that if this is the case then may be you are also required to always use TEXT with a string literal (and not with expressions or variables) to allow an external tool to detect all literals that need translations by a simple scan of the source code. For example may be you should never write:
puts(TEXT(flag ? "Yes" : "No"));
and you should write instead
puts(flag ? TEXT("Yes") : TEXT("No"));
Something that is instead standard but not used very often is the parameteric width of a field: for example in printf("%*i", x, y) the first parameter x is the width used to print the second parameter y as a decimal value.
When used with scanf instead the * special char can be used to specify that you don't want to store the field (i.e. to "skip" it instead of reading it).
TEXT() is probably a macro or function which returns a string value. I think it is user defined and does some manner of formatting on that string which is passed as an argument to the TEXT function. You should go to the function declaration for TEXT() to see what exactly it does.
TEXT() is a unicode support macro defined in winnt.h. If UNICODE is defined then it prepends L to the string making it wide.
Also see TEXT vs. _TEXT vs. _T, and UNICODE vs. _UNICODE blog post.
_TEXT() or _T() is a microsoft specific macro.
This MSDN link says
To simplify code development for various international markets,
the Microsoft run-time library provides Microsoft-specific "generic-text" mappings for many data types, routines, and other objects.
These mappings are defined in TCHAR.H.
You can use these name mappings to write generic code that can be compiled for any of the three kinds of character sets:
ASCII (SBCS), MBCS, or Unicode, depending on a manifest constant you define using a #define statement.
Generic-text mappings are Microsoft extensions that are not ANSI compatible.
_TEXT is a macro to make a strings "character set neutral".
For example _T("HELLO");
Characters can either be denoted by 8 bit ANSI standards or the 16 bit Unicode notation.
If you define _TEXT for all strings and define a preprocessor symbol "_UNICODE", all such strings will follow UNICODE encoding. If you don’t define _UNICODE, the strings will all be ANSI.
Hence the macro _TEXT allows you to have all strings as UNICODE or ANSI.
So no need to change every time you change your character set.

Resources