This program uses the scanf function and %s as the format specifier. This function is fix and I can not change anything in the program. Now I have to insert characters so that I get a special ASCII code in the storage. I already found out that if I want to write NUL (0x00) into the storage I can use 'Ctrl'+'Shift'+'#'.
How can I get all the other special ASCII numbers?
I don't know if this is important, I use linux and have an english keyboard.
Related
wprintf() takes a wchar_t string as argument and prints the string in the specified locale character encoding.
But I have noticed that when using printf() and passing it a UTF-8 string, the UTF-8 string will always be printed regardless of the specified locale character encoding (for example, if the UTF-8 string contains Arabic characters, and the locale is set to "C" (not "C.UTF-8"), then the Arabic characters will still be printed).
Am I correct that printf() doesn't care about the locale?
True printf doesn't care about locale for c-strings. If you pass it an UTF-8 string, it knows nothing about it, it just see a sequence of bytes (hopefully terminated by ascii NUL). Then, bytes are passed to the output as-is, and are interpreted by the terminal (or whatever is the output). If the terminal is able to interpret UTF-8 sequences it then does so (if not, it tries to interpret it the way it is configured, Latin-1 or alike) and if it is also able to print them correctly then it does so (sometimes it doesn't have the right font/glyph and prints unknown characters as ? or alike).
This is one of the big virtues (perhaps the biggest virtue) of UTF-8: it's just a string of reasonably ordinary bytes. If your code-editing environment knows how to let you type
printf("Cööl!\n");
and if your display environment (e.g. your terminal window) knows how to display it, you can just write that, and run it, and it works (as it sounds like you've discovered).
So you don't need special run-time support, you don't need special header files or libraries or anything, you don't need to write your code in some fancy new Unicodey way -- you can just keep on using ordinary C strings and printf and friends like you're used to, and it all just works.
Of course, those two if's can be big ones. If you can't figure out how to (or your code editing environment won't let you) type the characters, or if your display environment doesn't display them, you may be stuck, or you may have to do some hard work after all. (Display environments that don't properly display UTF-8 output from C programs are evidently quite common, based on the number of times the question gets asked here on SO.)
See also the "UTF-8 Everywhere" manifesto.
(Now, with all of this said, this doesn't mean that printf doesn't care about locale settings at all. There are aspects of the locale that printf may care about, and there may be character sets and encodings that printf might have to treat specially, in a locale-dependent way. But since printf doesn't have to do anything special to make UTF-8 work right, that one aspect of the locale -- although it's a biggie -- doesn't end up affecting printf at all.)
Let's consider the following simple program, which uses printf() to print a wide string if run without command-line arguments, and wprintf() otherwise:
#include <stdlib.h>
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
const wchar_t hello1[] = L"تحية طيبة";
const wchar_t hello2[] = L"Tervehdys";
int main(int argc, char *argv[])
{
if (!setlocale(LC_ALL, ""))
fprintf(stderr, "Warning: Current locale is not supported by the C library.\n");
if (argc <= 1) {
printf("printf 1: %ls\n", hello1);
printf("printf 2: %ls\n", hello2);
} else {
wprintf(L"wprintf: %ls\n", hello1);
wprintf(L"wprintf: %ls\n", hello2);
}
return EXIT_SUCCESS;
}
Using the GNU C library and any UTF-8 locale:
$ ./example
printf 1: تحية طيبة
printf 2: Tervehdys
$ ./example wide
wprintf: تحية طيبة
wprintf: Tervehdys
i.e. both produce the exact same output. However, if we run the example in the C/POSIX locale (that only supports ASCII), we get
$ LANG=C LC_ALL=C ./example
printf 1: printf 2: Tervehdys
i.e., the first printf() stopped at the first non-ASCII character (and that's why the second printf() printed on the same line);
$ LANG=C LC_ALL=C ./example wide
wprintf: ???? ????
wprintf: Tervehdys
i.e. wprintf() replaces wide characters that cannot be represented in the charset used by the current locale with a ?.
So, if we consider the GNU C library (which exhibits this behaviour), then we must say yes, printf cares about the locale, although it actually mostly cares about the character set used by the locale, and not the locale per se:
printf() will stop when trying to print wide strings that cannot be represented by the current character set (as defined by the locale). wprintf() will output question marks for those characters instead.
libc6-2.23-0ubuntu10 on x86-64 (amd64) does some replacements for multibyte characters in the printf format string, but multibyte characters in strings printed with %s are printed as-is. Which means it is a bit complicated to say exactly what gets printed and when the printf() gives up on the first multibyte or wide character it cannot convert, or just prints as-is.
However, wprintf() is pretty rock solid. (It too may choke if you try to print narrow strings with multibyte characters not representable in the character set used by the current locale, but for wide string stuff, it seems to work very well.)
Do note that POSIX.1 C libraries also provide iconv_open(), iconv(), and iconv_close() for converting strings, as well as mbstowcs() and wcstombs() to convert between wide and narrow/multibyte strings. You can also use asprintf() to create a dynamically allocated narrow string out of narrow and/or wide character strings (%s and %ls, respectively).
I use the c program try to count words from a text.But when the text is man malloc >x,the hyphen also is printed.
finally ,i find the hyphen is a multi-character character.
who can tell me the hyphen's ascii.
it's about in line 17 in man malloc >x.
first of all, if you don't give the character, we can't know what it is. A man page is always reformated, and modified to fit the context of displaying.
i find the hyphen is a multi-character character. who can tell me the hyphen's ascii.
if it's a multi-character character, then it's not ASCII, it's unicode. And my guess is that it's:
‐
which is unicode character 8208. Hint, in python3 run:
>>> print(ord('–'))
8208
now to handle that, you need to include wchar.h, use a wchar_t* string and count the characters using wcslen(). As you like to read manuals:
man wcslen
as a snippet:
const wchar_t* s = "This is an hyphen: `–` !";
printf("%d", wcslen(s));
N.B.: to avoid hyphenization of words in your manpage displaying, you may want to setup your COLUMNS env variable to a very large value ;-)
N.B.2: you may also want to use nroff -mandoc /usr/share/man/man3/malloc.3 and look at nroff options to better fit your usage, and avoid hyphenization.
If I have a character array that is in EBCDIC format and I want to save that array to a file. I'm thinking of using fputs to output the character array without first converting it to another format.
Question) Is the use of fputs legal for writing EBCDIC? If not, should I convert the string to ASCII before outputting?
I've search online, but couldn't find anything to say fputs should not be used for outputting EBCDIC data.
If your character array that is in EBCDIC format is a c-style string in that in ends with a \0 byte, then there is no problem.
fputs(), in binary mode, is format agnostic other than it does not write a \0.
Assuming your program is written using the ASCII char set, it is important that your output file is opened in binary mode (e. g. "wb"), else the \n of C will not match the same in EBCDIC and some translations are possible.
On the other hand, are you going to do something with this file other than write and maybe read back?
Should your "character array that is in EBCDIC format" not end in \0 or have embedded \0 bytes, suggest you simple use fwrite(). Again be sure to use in binary mode, unless your entire system is EBCDIC.
Well, fputs takes a C string, and that uses the ASCII encoding . So, that won't work. I think you'll need to write the file using a lower level function. Perhaps use fwrite to write the file directly without using strings. Here's the man page on fwrite.
Is there a way one can issue non ascii hex characters to a scanf that uses %s ? I'm trying to insert hexadecimal chars like \x08\xDE\xAD and so on (to demonstrate buffer overflow).
The input is not to a command line parameter, but to a scanf inside the program.
I assume you want to feed arbitrary data on stdin (since you read with scanf).
You can use the shell to create the data and pipe it into your program, e.g.
printf '\x08\xDE\xAD' | yourprogram
Note that this will only work as long as there are no white-space characters to be fed (because scanf with a %s format stops at white-space).
When you say 'to a scanf()', presumably there is other data than just this to be supplied. Would it work to have a program, perhaps a Perl or Python script, generate the data and write the non-ASCII characters to the standard input of your program? If you need standard input to appear like a terminal, then you should investigate expect which handles that for you. This is a common way of dealing with the problem.
I'm trying to print out a wchar_t* string.
Code goes below:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
char *ascii_ = "中日友好"; //line-1
wchar_t *wchar_ = L"中日友好"; //line-2
int main()
{
printf("ascii_: %s\n", ascii_); //line-3
wprintf(L"wchar_: %s\n", wchar_); //line-4
return 0;
}
//Output
ascii_: 中日友好
Question:
Apparently I should not assign CJK characters to char* pointer in line-1, but I just did it, and the output of line-3 is correct, So why? How could printf() in line-3 give me the non-ascii characters? Does it know the encoding somehow?
I assume the code in line-2 and line-4 are correct, but why I didn't get any output of line-4?
First of all, it's usually not a good idea to use non-ascii characters in source code. What's probably happening is that the chinese characters are being encoded as UTF-8 which works with ascii.
Now, as for why the wprintf() isn't working. This has to do with stream orientation. Each stream can only be set to either normal or wide. Once set, it cannot be changed. It is set the first time it is used. (which is ascii due to the printf). After that the wprintf will not work due the incorrect orientation.
In other words, once you use printf() you need to keep on using printf(). Similarly, if you start with wprintf(), you need to keep using wprintf().
You cannot intermix printf() and wprintf(). (except on Windows)
EDIT:
To answer the question about why the wprintf line doesn't work even by itself. It's probably because the code is being compiled so that the UTF-8 format of 中日友好 is stored into wchar_. However, wchar_t needs 4-byte unicode encoding. (2-bytes in Windows)
So there's two options that I can think of:
Don't bother with wchar_t, and just stick with multi-byte chars. This is the easy way, but may break if the user's system is not set to the Chinese locale.
Use wchar_t, but you will need to encode the Chinese characters using unicode escape sequences. This will obviously make it unreadable in the source code, but it will work on any machine that can print Chinese character fonts regardless of the locale.
Line 1 is not ascii, it's whatever multibyte encoding is used by your compiler at compile-time. On modern systems that's probably UTF-8. printf does not know the encoding. It's just sending bytes to stdout, and as long as the encodings match, everything is fine.
One problem you should be aware of is that lines 3 and 4 together invoke undefined behavior. You cannot mix character-based and wide-character io on the same FILE (stdout). After the first operation, the FILE has an "orientation" (either byte or wide), and after that any attempt to perform operations of the opposite orientation results in UB.
You are omitting one step and therefore think the wrong way.
You have a C file on disk, containing bytes. You have a "ASCII" string and a wide string.
The ASCII string takes the bytes exactly like they are in line 1 and outputs them.
This works as long as the encoding of the user's side is the same as the one on the programmer's side.
The wide string first decodes the given bytes into unicode codepoints and stored in the program- maybe this goes wrong on your side. On output they are encoded again according to the encoding on the user's side. This ensures that these characters are emitted as they are intended to, not as they are entered.
Either your compiler assumes the wrong encoding, or your output terminal is set up the wrong way.