Reading and printing out Unicode and UTF-8 encodings in C - c

I'm using Windows XP.
I want to read from files in ASCII, UTF-8 and Unicode encodings and print out strings on stdout.
I was trying to use functions from wchar.h like fgetwc()/fputwc() and fgetws()/fputws(), they work on ASCII but not when a file is in UTF-8 or Unicode. Doesn't print out language specific characters and when a file is in Unicode it doesn't print out anything but the box and first letter.
Is there any way of making a program in pure C that will read files, compare strings and print them out correctly on stdout regardless of the encoding of the files fed to the program?

Since you're on Windows, the key is that you want to write your strings out using the WriteConsoleW function, having first assembled the sequence of UTF-16 characters that you want to write out. (You probably should only write a few kilobytes of characters at a time.) Use GetStdHandle to obtain the console handle, of course.
Harder is determining the encoding of a file. Luckily, you don't need to distinguish between ASCII and UTF-8 as the latter is a strict superset of the former. But for any other single-byte encoding, you need to guess. Some UTF-8 files, more likely so on Windows than elsewhere, have a UTF-8 encoded byte-order mark at the beginning of the file; that's nasty as BOMs are not really supposed to be used with UTF-8, but a strong indicator if present. (Spotting UTF-16 is easier, as it should either have a byte-order mark, or you can guess it from the presence of NUL (0) bytes.)

Here's a little piece of code I used to print various characters outside of the ASCII subset of Unicode (contains workarounds for what seems to be a bug in the Open Watcom compiler's implementation of printf()):
// Compile with Open Watcom C/C++ 1.9: wcl386 cons-utf8.c
#include <windows.h>
#include <stdio.h>
#include <stddef.h>
// Workarounds for printf() not printing multi-byte (UTF-8) strings
// with Open Watcom C/C++ 1.7-1.9.
// 0 - no workaround for printf()
// 1 - setbuf(stdout, NULL) before printf()
// 2 - fflush(stdout) after printf()
// 3 - WriteConsole() instead of printf()
#define PRINT_WORKAROUND 03
int main(void)
{
DWORD err, i, j;
// Code point ranges of characters to print
static const DWORD ranges[][2] =
{
{ 0x0A0, 0x0FF }, // Latin chars with diacritic marks + some others
{ 0x391, 0x3CE }, // Greek chars
{ 0x410, 0x44F } // Cyrillic chars
};
#if PRINT_WORKAROUND == 1
setbuf(stdout, NULL);
#endif
if (!SetConsoleOutputCP(CP_UTF8))
{
err = GetLastError();
printf("SetConsoleOutputCP(CP_UTF8) failed with error 0x%X\n", err);
goto Exit;
}
printf("Workaround: %d\n", PRINT_WORKAROUND);
for (j = 0; j < sizeof(ranges) / sizeof(ranges[0]); j++)
{
for (i = ranges[j][0]; i <= ranges[j][1]; i++)
{
char str[8];
int sz;
wchar_t wstr[2];
wstr[0] = i;
wstr[1] = 0;
sz = WideCharToMultiByte(CP_UTF8,
0,
wstr,
-1,
str,
sizeof(str),
NULL,
NULL);
if (sz <= 0)
{
err = GetLastError();
printf("WideCharToMultiByte() failed with error 0x%X\n", err);
goto Exit;
}
#if PRINT_WORKAROUND < 3
printf("%s", str);
#if PRINT_WORKAROUND == 2
fflush(stdout);
#endif
#else
WriteConsole(GetStdHandle(STD_OUTPUT_HANDLE),
str,
sz - 1,
&err,
NULL);
#endif
}
printf("\n");
}
printf("\n");
Exit:
return 0;
}
Output:
C:\>cons-utf8.exe
Workaround: 3
 ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡ΢ΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώ
АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя
I didn't find a way to print UTF-16 code points directly to the console in Windows XP that would work the same as above.

Is there any way of making a program in pure C that will read files, compare strings and print them out correctly on stdout regardless of the encoding of the files fed to the program?
No, the program has obviously to told the encoding of the files too. Internally, you can choose to represent the data of the files with multibyte strings in UTF-8 or with wide strings.

Related

C: Cannot write Unicode braille to UTF-8 doc when compiling for Windows

I have some code that works perfectly fine on Linux BUT on Windows it only works as expected if is compiled using Cygwin, which emulates a Linux env. on Windows but is bad for portability (you must have Cygwin installed for compiled binary to work.) The program does the following:
Opens a document in read mode and ccs=UTF-8 and reads it char by char.
Writes the braille Unicode pattern (U+2800..U+28FF) corresponding to that letter, num. or punct. mark to a 'dest' document (opened in write mode and ccs=UTF-8)
Significant code:
const char *brai[26] = {
"⠁","⠃","⠉","⠙","⠑","⠋","⠛","⠓","⠊","⠚",
"⠅","⠇","⠍","⠝","⠕","⠏","⠟","⠗","⠎","⠞",
"⠥","⠧","⠭","⠽","⠵","⠺"
}
int main(void) {
setlocale(LC_ALL, "es_MX.UTF-8");
FILE *source = fopen(origen, "r, ccs=UTF-8");
FILE *dest = fopen(destino, "w, ccs=UTF-8");
unsigned int letra;
while ((letra = fgetc(source)) != EOF) {
// This next line is the problem, I guess:
fwprintf(dest, L"%s", "⠷"); // Prints directly the braille sign as a char[]
// OR prints it from an array that contains the exact same sign.
fwprintf(dest, L"%s", brai[7]);
}
}
Code works as expected on Linux every time, but not for Windows. I tried everything and nothing seems to get the output right. On the 'dest' document I get random chars like:
甥╩極肠─猀甥iꃢ¨.
The only way to print braille patterns to the doc so far on Windows was:
fwprintf(dest, L"⠷");
Which is not very useful (would need to make an 'else if' for every case instead).
If you wish to see the full code, it's on Github:
https://github.com/oliver-almaraz/Texto_a_Braille
What I tried so far:
Changing files open options to UTF-16LE and UNICODE.
Changing fwprintf() arguments in every way I could imagine.
Changing the array properties to unsigned int for the arrays containing the braille patterns.
Different compilers.
Here's a tested (with MSVC and mingw on Windows), semi-working example.
#include <stdio.h>
#include <ctype.h>
const char *brai[26] = {
"⠁","⠃","⠉","⠙","⠑","⠋","⠛","⠓","⠊","⠚",
"⠅","⠇","⠍","⠝","⠕","⠏","⠟","⠗","⠎","⠞",
"⠥","⠧","⠭","⠽","⠵","⠺"
};
int main(void) {
char* origen = "a.txt";
char* destino = "b.txt";
FILE *source = fopen(origen, "r");
FILE *dest = fopen(destino, "w");
int letra;
while ((letra = fgetc(source)) != EOF) {
if (isupper(letra))
fprintf(dest, "%s", brai[letra - 'A']);
else if (islower(letra))
fprintf(dest, "%s", brai[letra - 'a']);
else
fprintf (dest, "%c", letra);
}
}
Note these things.
No locale or wide character or anything like that in sight. None of this is needed.
This code only translates English letters. No punctuation or numbers (I don't know nearly enough about Braille to add that, but this should be straightforward).
Since the code only translates English letters and leaves everything else as is, it is OK to feed it a UTF-8 encoded file. It will just leave unrecognised characters untranslated. If you ever need to translate accented letters, you will need to learn a whole lot more about Unicode. Here is a good place to start.
Error handling omitted for brevity.
The code must use the correct charset. For MSVC, either UTF-8 with BOM or UTF16, alternatively use UTF-8 without BOM and /utf-8 compiler switch if your MSVC version recognises it. For mingw, just use UTF-8.
This method will not work for standard console output on Windows. It is not a big problem since Windows console by default won't output Braille characters anyway. It will however work for msys console and many others.
Option 1: Use wchar_t and fwprintf. Make sure to save the source as UTF-8 w/ BOM encoding or use UTF-8 encoding and the /utf-8 switch to force assuming UTF-8 encoding on the Microsoft compiler; otherwise, MSVS assumes an ANSI encoding for the source file and you get mojibake.
#include <stdio.h>
const wchar_t brai[] = L"⠁⠃⠉⠙⠑⠋⠛⠓⠊⠚⠅⠇⠍⠝⠕⠏⠟⠗⠎⠞⠥⠧⠭⠽⠵⠺";
int main(void) {
FILE *dest = fopen("out.txt", "w, ccs=UTF-8");
fwprintf(dest, L"%s", brai);
}
out.txt (encoded as UTF-8 w/ BOM):
⠁⠃⠉⠙⠑⠋⠛⠓⠊⠚⠅⠇⠍⠝⠕⠏⠟⠗⠎⠞⠥⠧⠭⠽⠵⠺
Option 2: Use char and fprintf, save the source as UTF-8 or UTF-8 w/ BOM, and use the /utf-8 Microsoft compile switch. The char string will be in the source encoding, so it must be UTF-8 to get UTF-8 in the output file.
#include <stdio.h>
const char brai[] = "⠁⠃⠉⠙⠑⠋⠛⠓⠊⠚⠅⠇⠍⠝⠕⠏⠟⠗⠎⠞⠥⠧⠭⠽⠵⠺";
int main(void) {
FILE *dest = fopen("out.csv","w");
fprintf(dest, "%s", brai);
}
The latest compiler can also use the u8"" syntax. The advantage here is you can use a different source encoding and the char string will still be UTF-8 as long as you use the appropriate compiler switch to indicate the source encoding.
const char brai[] = u8"⠁⠃⠉⠙⠑⠋⠛⠓⠊⠚⠅⠇⠍⠝⠕⠏⠟⠗⠎⠞⠥⠧⠭⠽⠵⠺";
For reference, these are the Microsoft compiler options:
/source-charset:<iana-name>|.nnnn set source character set
/execution-charset:<iana-name>|.nnnn set execution character set
/utf-8 set source and execution character set to UTF-8

How multibyte string is converted to wide-character string in fxprintf.c in glibc?

Currently, the logic in glibc source of perror is such:
If stderr is oriented, use it as is, else dup() it and use perror() on dup()'ed fd.
If stderr is wide-oriented, the following logic from stdio-common/fxprintf.c is used:
size_t len = strlen (fmt) + 1;
wchar_t wfmt[len];
for (size_t i = 0; i < len; ++i)
{
assert (isascii (fmt[i]));
wfmt[i] = fmt[i];
}
res = __vfwprintf (fp, wfmt, ap);
The format string is converted to wide-character form by the following code, which I do not understand:
wfmt[i] = fmt[i];
Also, it uses isascii assert:
assert (isascii(fmt[i]));
But the format string is not always ascii in wide-character programs, because we may use UTF-8 format string, which can contain non-7bit value(s).
Why there is no assert warning when we run the following code (assuming UTF-8 locale and UTF-8 compiler encoding)?
#include <stdio.h>
#include <errno.h>
#include <wchar.h>
#include <locale.h>
int main(void)
{
setlocale(LC_CTYPE, "en_US.UTF-8");
fwide(stderr, 1);
errno = EINVAL;
perror("привет мир"); /* note, that the string is multibyte */
return 0;
}
$ ./a.out
привет мир: Invalid argument
Can we use dup() on wide-oriented stderr to make it not wide-oriented? In such case the code could be rewritten without using this mysterious conversion, taking into account the fact that perror() takes only multibyte strings (const char *s) and locale messages are all multibyte anyway.
Turns out we can. The following code demonstrates this:
#include <stdio.h>
#include <wchar.h>
#include <unistd.h>
int main(void)
{
fwide(stdout,1);
FILE *fp;
int fd = -1;
if ((fd = fileno (stdout)) == -1) return 1;
if ((fd = dup (fd)) == -1) return 1;
if ((fp = fdopen (fd, "w+")) == NULL) return 1;
wprintf(L"stdout: %d, dup: %d\n", fwide(stdout, 0), fwide(fp, 0));
return 0;
}
$ ./a.out
stdout: 1, dup: 0
BTW, is it worth posting an issue about this improvement to glibc developers?
NOTE
Using dup() is limited with respect to buffering. I wonder if it is considered in the implementation of perror() in glibc. The following example demonstrates this issue.
The output is done not in the order of writing to the stream, but in the order in which the data in the buffer is written-off.
Note, that the order of values in the output is not the same as in the program, because the output of fprintf is written-off first (because of "\n"), and the output of fwprintf is written off when program exits.
#include <wchar.h>
#include <stdio.h>
#include <unistd.h>
int main(void)
{
wint_t wc = L'b';
fwprintf(stdout, L"%lc", wc);
/* --- */
FILE *fp;
int fd = -1;
if ((fd = fileno (stdout)) == -1) return 1;
if ((fd = dup (fd)) == -1) return 1;
if ((fp = fdopen (fd, "w+")) == NULL) return 1;
char c = 'h';
fprintf(fp, "%c\n", c);
return 0;
}
$ ./a.out
h
b
But if we use \n in fwprintf, the output is the same as in the program:
$ ./a.out
b
h
perror() manages to get away with that, because in GNU libc stderr is unbuffered. But will it work safely in programs where stderr is manually set to buffered mode?
This is the patch that I would propose to glibc developers:
diff -urN glibc-2.24.orig/stdio-common/perror.c glibc-2.24/stdio-common/perror.c
--- glibc-2.24.orig/stdio-common/perror.c 2016-08-02 09:01:36.000000000 +0700
+++ glibc-2.24/stdio-common/perror.c 2016-10-10 16:46:03.814756394 +0700
## -36,7 +36,7 ##
errstring = __strerror_r (errnum, buf, sizeof buf);
- (void) __fxprintf (fp, "%s%s%s\n", s, colon, errstring);
+ (void) _IO_fprintf (fp, "%s%s%s\n", s, colon, errstring);
}
## -55,7 +55,7 ##
of the stream. What is supposed to happen when the stream isn't
oriented yet? In this case we'll create a new stream which is
using the same underlying file descriptor. */
- if (__builtin_expect (_IO_fwide (stderr, 0) != 0, 1)
+ if (__builtin_expect (_IO_fwide (stderr, 0) < 0, 1)
|| (fd = __fileno (stderr)) == -1
|| (fd = __dup (fd)) == -1
|| (fp = fdopen (fd, "w+")) == NULL)
NOTE: It wasn't easy to find concrete questions in this post; on the whole, the post seems to be an attempt to engage in a discussion about implementation details of glibc, which it seems to me would be better directed to a forum specifically oriented to development of that library such as the libc-alpha mailing list. (Or see https://www.gnu.org/software/libc/development.html for other options.) This sort of discussion is not really a good match for StackOverflow, IMHO. Nonetheless, I tried to answer the questions I could find.
How does wfmt[i] = fmt[i]; convert from multibyte to wide character?
Actually, the code is:
assert(isascii(fmt[i]));
wfmt[i] = fmt[i];
which is based on the fact that the numeric value of an ascii character is the same as a wchar_t. Strictly speaking, this need not be the case. The C standard specifies:
Each member of the basic character set shall have a code value equal to its value when used as the lone character in an integer character constant if an implementation does not define __STDC_MB_MIGHT_NEQ_WC__. (§7.19/2)
(gcc does not define that symbol.)
However, that only applies to characters in the basic set, not to all characters recognized by isascii. The basic character set contains the 91 printable ascii characters as well as space, newline, horizontal tab, vertical tab and form feed. So it is theoretically possible that one of the remaining control characters will not be correctly converted. However, the actual format string used in the call to __fxprintf only contains characters from the basic character set, so in practice this pedantic detail is not important.
Why there is no assert warning when we execute perror("привет мир");?
Because only the format string is being converted, and the format string (which is "%s%s%s\n") contains only ascii characters. Since the format string contains %s (and not %ls), the argument is expected to be char* (and not wchar_t*) in both the narrow- and wide-character orientations.
Can we use dup() on wide-oriented stderr to make it not wide-oriented?
That would not be a good idea. First, if the stream has an orientation, it might also have a non-empty internal buffer. Since that buffer is part of the stdio library and not of the underlying Posix fd, it will not be shared with the duplicate fd. So the message printed by perror might be interpolated in the middle of some existing output. In addition, it is possible that the multibyte encoding has shift states, and that the output stream is not currently in the initial shift state. In that case, outputting an ascii sequence could result in garbled output.
In the actual implementation, the dup is only performed on streams without orientation; these streams have never had any output directed at them, so they are definitely still in the initial shift state with an empty buffer (if the stream is buffered).
Is it worth posting an issue about this improvement to glibc developers?
That is up to you, but don't do it here. The normal way of doing that would be to file a bug. There is no reason to believe that glibc developers read SO questions, and even if they do, someone would have to copy the issue to a bug, and also copy any proposed patch.
it uses isascii assert.
This is OK. You are not supposed to call this function. It is a glibc internal. Note the two underscores in front of the name. When called from perror, the argument in question is "%s%s%s\n", which is entirely ASCII.
But the format string is not always ascii in wide-character programs, because we may use UTF-8
First, UTF-8 has nothing to do with wide characters. Second, the format string is always ASCII because the function is only called by other glibc functions that know what they are doing.
perror("привет мир");
This is not the format string, this is one of the arguments that corresponds to one of the %s in the actual format string.
Can we use dup() on wide-oriented stderr
You cannot use dup on a FILE*, it operates on POSIX
file descriptors that don't have orientation.
This is the patch that I would propose to glibc developers:
Why? What isn't working?

How to detect the character encoding of command line arguments in mingw

Is it safe to assume they are ISO-8859-15 (Window-1252?), or is there some function I can call to query this? The end goal is to conversion to UTF-8.
Background:
The problem described by this question arises because XMLStarlet assumes its command line arguments are UTF-8. Under Windows it seems they are actually ISO-8859-15 (Window-1252?), or at least adding the following to the beginning of main makes things work:
char **utf8argv = malloc(sizeof(char*) * (argc+1));
utf8argv[argc] = NULL;
{
iconv_t windows2utf8 = iconv_open("UTF-8", "ISO-8859-15");
int i;
for (i = 0; i < argc; i++) {
const char *arg = argv[i];
size_t len = strlen(arg);
size_t outlen = len*2 + 1;
char *utfarg = malloc(outlen);
char *out = utfarg;
size_t ret = iconv(windows2utf8,
&arg, &len,
&out, &outlen);
if (ret < 0) {
perror("iconv");
utf8argv[i] = NULL;
continue;
}
out[0] = '\0';
utf8argv[i] = utfarg;
}
argv = utf8argv;
}
Testing Encoding
The following program prints out the bytes of its first argument in decimal:
#include <strings.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
for (int i = 0; i < strlen(argv[1]); i++) {
printf("%d ", (unsigned char) argv[1][i]);
}
printf("\n");
return 0;
}
chcp reports code page 850, so the characters æ and Æ should be 145 and 146, respectively.
C:\Users\npostavs\tmp>chcp
Active code page: 850
But we see 230 and 198 reported which matches 1252:
C:\Users\npostavs\tmp>cmd-chars æÆ
230 198
Passing characters outside of codepage causes lossy transformation
Making a shortcut to cmd-chars.exe with arguments αβγ (these are not present in codepage 1252) gives
C:\Users\npostavs\tmp>shortcut-cmd-chars.lnk
97 223 63
Which is aß?.
You can call CommandLineToArgvW with a call to GetCommandLineW as the first argument to get the command-line arguments in an argv-style array of wide strings. This is the only portable Windows way, especially with the code page mess; Japanese characters can be passed via a Windows shortcut for example. After that, you can use WideCharToMultiByte with a code page argument of CP_UTF8 to convert each wide-character argv element to UTF-8.
Note that calling WideCharToMultiByte with an output buffer size (byte count) of 0 will allow you to determine the number of UTF-8 bytes required for the number of characters specified (or the entire wide string including the null terminator if you wish to pass -1 as the number of wide characters to simplify your code). Then you can allocate the required number of bytes using malloc et al. and call WideCharToMultiByte again with the correct number of bytes instead of 0. If this was performance-critical, a different solution would probably be best, but since this is a one-time function to get command-line arguments, I'd say any decrease in performance would be negligible.
Of course, don't forget to free all of your memory, including calling LocalFree with the pointer returned by CommandLineToArgvW as the argument.
For more info on the functions and how you can use them, click the links to see the MSDN documentation.
The command-line parameters are in the system default codepage, which varies depending on system settings. Rather than specify a specific source charset at all, you can specify "char" or "" instead and let iconv_open() figure out what the system charset actually is:
iconv_t windows2utf8 = iconv_open("UTF-8", "char");
Otherwise, you are better off retrieving the command-line as UTF-16 instead of as Ansi, and then you can convert it directly to UTF-8 using iconv_open("UTF-8", "UTF-16LE"), or WideCharToMultiByte(CP_UTF8) like Chrono suggested.
It seems that you are under windows.
In this case, you can make a system() call to run the CHCP command.
#include <stdlib.h> // Uses: system()
#include <stdio.h>
// .....
// 1st: Store the present windows codepage in a text file:
system("CMD /C \"CHCP > myenc.txt\"");
// 2nd: Read the first line in the file:
FILE *F = fopen("myenc.txt", "r");
char buffer[100];
fgets(buffer, F);
fclose(F);
// 3rd: Analyze the loaded string to find the Windows codepage:
int codepage = my_CHCP_analizer_func(buffer);
// The function my_CHCP_analizer_func() must be written for you,
// and it has to take in account the way in that CHCP prints the information.
Finally, the codepages sent by CHCP can be checked for example here:
Windows Codepages

write produces strange resuts in files

I'm having strange results doing a simple open and write. I'll quote the program and then i'll explain my results:
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
int main(){
int fd = open("hello.txt", O_WRONLY|O_CREAT, 0700);
char buf[] = "Hello world\n";
int i=0;
for(i = 0; i < 10;i++){
write(1,buf,sizeof(buf));
write(fd, buf, sizeof(buf));
}
close(fd);
return 0;
}
Using this code, the output in the terminal will be Hello world ten times as expected... But on the file hello.txt I get this:
效汬潷汲੤䠀汥潬眠牯摬
效汬潷汲੤䠀汥潬眠牯摬
效汬潷汲੤䠀汥潬眠牯摬
效汬潷汲੤䠀汥潬眠牯摬
效汬潷汲੤䠀汥潬眠牯摬
1)Why does this happens? What did I do wrong? And why chinese?
Thanks in advance
Edit: Compiling using gcc 4.8.1 with -Wall flag: no warnings
You are writing 13 characters (sizeof(buf), which includes the terminating 0.
Because you are sending a literal 0 to your terminal, it probably assumes your text is 'binary' (at least, that is what OS X's Terminal warns me for), and hence it attempts to convert the text to a likely encoding: Unicode 16 bit. This is 'likely', because in Latin text, lots of characters have a 0 in their 16-bit code.
If you check the Unicode values of these Chinese characters, you will find
效 = U+6548
汬 = U+6C6C
潷 = U+6F77
汲 = U+6C72
which seem to contain the hex codes for the 8-bit characters you wanted. I suspect the space U+0020 is missing in this list because your terminal refuses to show "invalid" Unicode characters.
Forgot to add the obvious solution: write out one character less. Or, more obvious, write out strlen(buf) characters.

C read binary stdin

I'm trying to build an instruction pipeline simulator and I'm having a lot of trouble getting started. What I need to do is read binary from stdin, and then store it in memory somehow while I manipulate the data. I need to read in chunks of exactly 32 bits one after the other.
How do I read in chunks of exactly 32 bits at a time? Secondly, how do I store it for manipulation later?
Here's what I've got so far, but examining the binary chunks I read further, it just doesn't look right, I don't think I'm reading exactly 32 bits like I need.
char buffer[4] = { 0 }; // initialize to 0
unsigned long c = 0;
int bytesize = 4; // read in 32 bits
while (fgets(buffer, bytesize, stdin)) {
memcpy(&c, buffer, bytesize); // copy the data to a more usable structure for bit manipulation later
// more stuff
buffer[0] = 0; buffer[1] = 0; buffer[2] = 0; buffer[3] = 0; // set to zero before next loop
}
fclose(stdin);
How do I read in 32 bits at a time (they are all 1/0, no newlines etc), and what do I store it in, is char[] okay?
EDIT: I'm able to read the binary in but none of the answers produce the bits in the correct order — they are all mangled up, I suspect endianness and problems reading and moving 8 bits around ( 1 char) at a time — this needs to work on Windows and C ... ?
What you need is freopen(). From the manpage:
If filename is a null pointer, the freopen() function shall attempt to change the mode of the stream to that specified by mode, as if the name of the file currently associated with the stream had been used. In this case, the file descriptor associated with the stream need not be closed if the call to freopen() succeeds. It is implementation-defined which changes of mode are permitted (if any), and under what circumstances.
Basically, the best you can really do is this:
freopen(NULL, "rb", stdin);
This will reopen stdin to be the same input stream, but in binary mode. In the normal mode, reading from stdin on Windows will convert \r\n (Windows newline) to the single character ASCII 10. Using the "rb" mode disables this conversion so that you can properly read in binary data.
freopen() returns a filehandle, but it's the previous value (before we put it in binary mode), so don't use it for anything. After that, use fread() as has been mentioned.
As to your concerns, however, you may not be reading in "32 bits" but if you use fread() you will be reading in 4 chars (which is the best you can do in C - char is guaranteed to be at least 8 bits but some historical and embedded platforms have 16 bit chars (some even have 18 or worse)). If you use fgets() you will never read in 4 bytes. You will read in at least 3 (depending on whether any of them are newlines), and the 4th byte will be '\0' because C strings are nul-terminated and fgets() nul-terminates what it reads (like a good function). Obviously, this is not what you want, so you should use fread().
Consider using SET_BINARY_MODE macro and setmode:
#ifdef _WIN32
# include <io.h>
# include <fcntl.h>
# define SET_BINARY_MODE(handle) setmode(handle, O_BINARY)
#else
# define SET_BINARY_MODE(handle) ((void)0)
#endif
More details about SET_BINARY_MODE macro here: "Handling binary files via standard I/O"
More details about setmode here: "_setmode"
I had to piece the answer together from the various comments from the kind people above, so here is a fully-working sample that works - only for Windows, but you can probably translate the windows-specific stuff to your platform.
#include "stdafx.h"
#include "stdio.h"
#include "stdlib.h"
#include "windows.h"
#include <io.h>
#include <fcntl.h>
int main()
{
char rbuf[4096];
char *deffile = "c:\\temp\\outvideo.bin";
size_t r;
char *outfilename = deffile;
FILE *newin;
freopen(NULL, "rb", stdin);
_setmode(_fileno(stdin), _O_BINARY);
FILE *f = fopen(outfilename, "w+b");
if (f == NULL)
{
printf("unable to open %s\n", outfilename);
exit(1);
}
for (;; )
{
r = fread(rbuf, 1, sizeof(rbuf), stdin);
if (r > 0)
{
size_t w;
for (size_t nleft = r; nleft > 0; )
{
w = fwrite(rbuf, 1, nleft, f);
if (w == 0)
{
printf("error: unable to write %d bytes to %s\n", nleft, outfilename);
exit(1);
}
nleft -= w;
fflush(f);
}
}
else
{
Sleep(10); // wait for more input, but not in a tight loop
}
}
return 0;
}
For Windows, this Microsoft _setmode example specifically shows how to change stdin to binary mode:
// crt_setmode.c
// This program uses _setmode to change
// stdin from text mode to binary mode.
#include <stdio.h>
#include <fcntl.h>
#include <io.h>
int main( void )
{
int result;
// Set "stdin" to have binary mode:
result = _setmode( _fileno( stdin ), _O_BINARY );
if( result == -1 )
perror( "Cannot set mode" );
else
printf( "'stdin' successfully changed to binary mode\n" );
}
fgets() is all wrong here. It's aimed at human-readable ASCII text terminated by end-of-line characters, not binary data, and won't get you what you need.
I recently did exactly what you want using the read() call. Unless your program has explicitly closed stdin, for the first argument (the file descriptor), you can use a constant value of 0 for stdin. Or, if you're on a POSIX system (Linux, Mac OS X, or some other modern variant of Unix), you can use STDIN_FILENO.
fread() suits best for reading binary data.
Yes, char array is OK, if you are planning to process them bytewise.
I don't know what OS you are running, but you typically cannot "open stdin in binary". You can try things like
int fd = fdreopen (fileno (stdin), outfname, O_RDONLY | OPEN_O_BINARY);
to try to force it. Then use
uint32_t opcode;
read(fd, &opcode, sizeof (opcode));
But I have no actually tried it myself. :)
I had it right the first time, except, I needed ntohl ... C Endian Conversion : bit by bit

Resources