write produces strange resuts in files - c

I'm having strange results doing a simple open and write. I'll quote the program and then i'll explain my results:
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
int main(){
int fd = open("hello.txt", O_WRONLY|O_CREAT, 0700);
char buf[] = "Hello world\n";
int i=0;
for(i = 0; i < 10;i++){
write(1,buf,sizeof(buf));
write(fd, buf, sizeof(buf));
}
close(fd);
return 0;
}
Using this code, the output in the terminal will be Hello world ten times as expected... But on the file hello.txt I get this:
效汬潷汲੤䠀汥潬眠牯摬
效汬潷汲੤䠀汥潬眠牯摬
效汬潷汲੤䠀汥潬眠牯摬
效汬潷汲੤䠀汥潬眠牯摬
效汬潷汲੤䠀汥潬眠牯摬
1)Why does this happens? What did I do wrong? And why chinese?
Thanks in advance
Edit: Compiling using gcc 4.8.1 with -Wall flag: no warnings

You are writing 13 characters (sizeof(buf), which includes the terminating 0.
Because you are sending a literal 0 to your terminal, it probably assumes your text is 'binary' (at least, that is what OS X's Terminal warns me for), and hence it attempts to convert the text to a likely encoding: Unicode 16 bit. This is 'likely', because in Latin text, lots of characters have a 0 in their 16-bit code.
If you check the Unicode values of these Chinese characters, you will find
效 = U+6548
汬 = U+6C6C
潷 = U+6F77
汲 = U+6C72
which seem to contain the hex codes for the 8-bit characters you wanted. I suspect the space U+0020 is missing in this list because your terminal refuses to show "invalid" Unicode characters.
Forgot to add the obvious solution: write out one character less. Or, more obvious, write out strlen(buf) characters.

Related

fprintf(fp, "%c",10) not behaving as expected

Here's my code:
#include <stdio.h>
#include <stdlib.h>
main(){
FILE* fp = fopen("img.ppm","w");
fprintf(fp,"%c", 10);
fclose(fp);
return 0;
}
for some reason that I am unable to uncover, this writes 2 bytes to the file: "0x0D 0x0A" while the behaviour I would expect is for it to just write "0x0A" which is 10 in decimal. It seems to work fine with every single other value between 0 and 255 included, it just writes one byte to the file. I am completely lost, any help?
Assuming you are using the Windows C runtime library, newline characters are written as \r\n, or 13 10. Which is 0x0D 0x0A. This is the only character that's actually written as two characters (by software compiled using the Windows toolchain).
You need to open the file with fopen("img.ppm","wb") to write binary.

fgetws can't read non-English characters on Linux

I have a basic C program that reads some lines from a text file containing hundreds of lines in its working directory. Here is the code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <ctype.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
#include <wctype.h>
#include <unistd.h>
int main(int argc, const char * argv[]) {
srand((unsigned)time(0));
char *nameFileName = "MaleNames.txt";
wchar_t line[100];
wchar_t **nameLines = malloc(sizeof(wchar_t*) * 2000);
int numNameLines = 0;
FILE *nameFile = fopen(nameFileName, "r");
while (fgetws(line, 100, nameFile) != NULL) {
nameLines[numNameLines] = malloc(sizeof(wchar_t) * 100);
wcsncpy(nameLines[numNameLines], line, 100);
numNameLines++;
}
fclose(nameFile);
wchar_t *name = nameLines[rand() % numNameLines];
name[wcslen(name) - 1] = '\0';
wprintf(L"%ls", name);
int i;
for (i = 0; i < numNameLines; i++) {
free(nameLines[i]);
}
free(nameLines);
return 0;
}
It basically reads my text file (defined as a macro, it exists at the working directory) line by line. Rest is irrelevant. It runs perfect and as expected on my Mac (with llvm/Xcode). When I try to compile (nothing fancy, again, gcc main.c) and run it on a Linux server, it either:
Exists with error code 2 (meaning no lines are read).
Reads only first 3 lines from my file with hundreds of lines.
What causes this indeterministic (and incorrect) behavior? I've tried commenting out the first line (random seed) and compile again, it always exits with return code 2.
What is the relation between the random methods and reading a file, and why I'm getting this behavior?
UPDATE: I've fixed malloc to sizeof(wchar_t) * 100 from sizeof(wchar_t) * 50. It didn't change anything. My lines are about 15 characters at most, and there are much less than 2000 lines (it is guaranteed).
UPDATE 2:
I've compiled with -Wall, no issues.
I've compiled with -Werror, no issues.
I've run valgrind didn't find any leaks too.
I've debugged with gdb, it just doesn't enter the while loop (fgetws call returns 0).
UPDATE 3: I'm getting a floating point exception on Linux, as numNameLines is zero.
UPDATE 4: I verify that I have read permissions on MaleNames.txt.
UPDATE 5: I've found that accented, non-English characters (e.g. Â) cause problems while reading lines. fgetws halts on them. I've tried setting locale (both setlocale(LC_ALL, "en.UTF-8"); and setlocale(LC_ALL, "tr.UTF-8"); separately) but didn't work.
fgetws() is attempting to read up to 100 wide characters. The malloc() call in the loop allocates 50 wide characters.
The wcscpy() call copies all the wide characters read. If more than 50 wide characters have been read (including the terminating nul) then wcscpy() will overrun the allocated buffer. That results in undefined behaviour.
Instead of multiplying by 50 in the loop, multiply by 100. Or, better yet, compute the length of string read and use that.
Independently of the above, your code will also overrun a buffer if the file contains more than 2000 lines. Your loop needs to check for that.
A number of the functions in your code can fail, and will return a value to indicate that. Your code is not checking for any such failures.
Your code running under OS X is happenstance. The behaviour is undefined, which means there is potential to fail on any host system, when built with any compiler. Appearing to run correctly on one system, and failing on another system, is actually a valid set of responses to undefined behaviour.
Found the solution. It was all about the locale, from the beginning. After experimenting and hours of research, I've stumbled upon this: http://cboard.cprogramming.com/c-programming/142780-arrays-accented-characters.html#post1066035
#include < locale.h >
setlocale(LC_ALL, "");
Setting locale to empty string solved my problem instantly.

Reading and printing out Unicode and UTF-8 encodings in C

I'm using Windows XP.
I want to read from files in ASCII, UTF-8 and Unicode encodings and print out strings on stdout.
I was trying to use functions from wchar.h like fgetwc()/fputwc() and fgetws()/fputws(), they work on ASCII but not when a file is in UTF-8 or Unicode. Doesn't print out language specific characters and when a file is in Unicode it doesn't print out anything but the box and first letter.
Is there any way of making a program in pure C that will read files, compare strings and print them out correctly on stdout regardless of the encoding of the files fed to the program?
Since you're on Windows, the key is that you want to write your strings out using the WriteConsoleW function, having first assembled the sequence of UTF-16 characters that you want to write out. (You probably should only write a few kilobytes of characters at a time.) Use GetStdHandle to obtain the console handle, of course.
Harder is determining the encoding of a file. Luckily, you don't need to distinguish between ASCII and UTF-8 as the latter is a strict superset of the former. But for any other single-byte encoding, you need to guess. Some UTF-8 files, more likely so on Windows than elsewhere, have a UTF-8 encoded byte-order mark at the beginning of the file; that's nasty as BOMs are not really supposed to be used with UTF-8, but a strong indicator if present. (Spotting UTF-16 is easier, as it should either have a byte-order mark, or you can guess it from the presence of NUL (0) bytes.)
Here's a little piece of code I used to print various characters outside of the ASCII subset of Unicode (contains workarounds for what seems to be a bug in the Open Watcom compiler's implementation of printf()):
// Compile with Open Watcom C/C++ 1.9: wcl386 cons-utf8.c
#include <windows.h>
#include <stdio.h>
#include <stddef.h>
// Workarounds for printf() not printing multi-byte (UTF-8) strings
// with Open Watcom C/C++ 1.7-1.9.
// 0 - no workaround for printf()
// 1 - setbuf(stdout, NULL) before printf()
// 2 - fflush(stdout) after printf()
// 3 - WriteConsole() instead of printf()
#define PRINT_WORKAROUND 03
int main(void)
{
DWORD err, i, j;
// Code point ranges of characters to print
static const DWORD ranges[][2] =
{
{ 0x0A0, 0x0FF }, // Latin chars with diacritic marks + some others
{ 0x391, 0x3CE }, // Greek chars
{ 0x410, 0x44F } // Cyrillic chars
};
#if PRINT_WORKAROUND == 1
setbuf(stdout, NULL);
#endif
if (!SetConsoleOutputCP(CP_UTF8))
{
err = GetLastError();
printf("SetConsoleOutputCP(CP_UTF8) failed with error 0x%X\n", err);
goto Exit;
}
printf("Workaround: %d\n", PRINT_WORKAROUND);
for (j = 0; j < sizeof(ranges) / sizeof(ranges[0]); j++)
{
for (i = ranges[j][0]; i <= ranges[j][1]; i++)
{
char str[8];
int sz;
wchar_t wstr[2];
wstr[0] = i;
wstr[1] = 0;
sz = WideCharToMultiByte(CP_UTF8,
0,
wstr,
-1,
str,
sizeof(str),
NULL,
NULL);
if (sz <= 0)
{
err = GetLastError();
printf("WideCharToMultiByte() failed with error 0x%X\n", err);
goto Exit;
}
#if PRINT_WORKAROUND < 3
printf("%s", str);
#if PRINT_WORKAROUND == 2
fflush(stdout);
#endif
#else
WriteConsole(GetStdHandle(STD_OUTPUT_HANDLE),
str,
sz - 1,
&err,
NULL);
#endif
}
printf("\n");
}
printf("\n");
Exit:
return 0;
}
Output:
C:\>cons-utf8.exe
Workaround: 3
 ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡ΢ΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώ
АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя
I didn't find a way to print UTF-16 code points directly to the console in Windows XP that would work the same as above.
Is there any way of making a program in pure C that will read files, compare strings and print them out correctly on stdout regardless of the encoding of the files fed to the program?
No, the program has obviously to told the encoding of the files too. Internally, you can choose to represent the data of the files with multibyte strings in UTF-8 or with wide strings.

C: lseek() related question

I want to write some bogus text in a file ("helloworld" text in a file called helloworld), but not starting from the beginning. I was thinking to lseek() function.
If I use the following code (edited):
#include <unistd.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <stdlib.h>
#include <stdio.h>
#define fname "helloworld"
#define buf_size 16
int main(){
char buffer[buf_size];
int fildes,
nbytes;
off_t ret;
fildes = open(fname, O_CREAT | O_TRUNC | O_WRONLY, S_IRUSR | S_IWUSR);
if(fildes < 0){
printf("\nCannot create file + trunc file.\n");
}
//modify offset
if((ret = lseek(fildes, (off_t) 10, SEEK_END)) < (off_t) 0){
fprintf(stdout, "\nCannot modify offset.\n");
}
printf("ret = %d\n", (int)ret);
if(write(fildes, fname, 10) < 0){
fprintf(stdout, "\nWrite failed.\n");
}
close(fildes);
return (0);
}
, it compiles well and it runs without any apparent errors.
Still if i :
cat helloworld
The output is not what I expected, but:
helloworld
Can
Where is "Can" comming from, and where are my empty spaces ?
Should i expect for "zeros" instead of spaces ? If i try to open helloworld with gedit, an error occurs, complaining that the file character encoding is unknown.
LATER EDIT:
After I edited my program with the right buffer for writing, and then compile / run again, the "helloworld" file still cannot be opened with gedit.strong text
LATER EDIT
I understand the issue now. I've added to the code the following:
fildes = open(fname, O_RDONLY);
if(fildes < 0){
printf("\nCannot open file.\n");
}
while((nbytes = read(fildes, c, 1)) == 1){
printf("%d ", (int)*c);
}
And now the output is:
0 0 0 0 0 0 0 0 0 0 104 101 108 108 111 119 111 114 108 100
My problem was that i was expecting spaces (32) instead of zeros (0).
In this function call, write(fildes, fname, buf_size), fname has 10 characters (plus a trailing '\0' character, but you're telling the function to write out 16 bytes. Who knows what in the memory locations after the fname string.
Also, I'm not sure what you mean by "where are my empty spaces?".
Apart from expecting zeros to equal spaces, the original problem was indeed writing more than the length of the "helloworld" string. To avoid such a problem, I suggest letting the compiler calculate the length of your constant strings for you:
write(fildes, fname, sizeof(fname) - 1)
The - 1 is due to the NUL character (zero, \0) that is used to terminate C-style strings, and sizeof simply returning the size of the array that holds the string. Due to this you cannot use sizeof to calculate the actual length of a string at runtime, but it works fine for compile-time constants.
The "Can" you saw in your original test was almost certainly the beginning of one of the "\nCannot" strings in your code; after writing the 11 bytes in "helloworld\0" you continued to write the remaining bytes from whatever was following it in memory, which turned out to be the next string constant. (The question has now been amended to write 10 bytes, but the originally posted version wrote 16.)
The presence of NUL characters (= zero, '\0') in a text file may indeed cause certain (but not all) text editors to consider the file binary data instead of text, and possibly refuse to open it. A text file should contain just text, not control characters.
Your buf_size doesn't match the length of fname. It's reading past the buffer, and therefore getting more or less random bytes that just happened to sit after the string in memory.

C read binary stdin

I'm trying to build an instruction pipeline simulator and I'm having a lot of trouble getting started. What I need to do is read binary from stdin, and then store it in memory somehow while I manipulate the data. I need to read in chunks of exactly 32 bits one after the other.
How do I read in chunks of exactly 32 bits at a time? Secondly, how do I store it for manipulation later?
Here's what I've got so far, but examining the binary chunks I read further, it just doesn't look right, I don't think I'm reading exactly 32 bits like I need.
char buffer[4] = { 0 }; // initialize to 0
unsigned long c = 0;
int bytesize = 4; // read in 32 bits
while (fgets(buffer, bytesize, stdin)) {
memcpy(&c, buffer, bytesize); // copy the data to a more usable structure for bit manipulation later
// more stuff
buffer[0] = 0; buffer[1] = 0; buffer[2] = 0; buffer[3] = 0; // set to zero before next loop
}
fclose(stdin);
How do I read in 32 bits at a time (they are all 1/0, no newlines etc), and what do I store it in, is char[] okay?
EDIT: I'm able to read the binary in but none of the answers produce the bits in the correct order — they are all mangled up, I suspect endianness and problems reading and moving 8 bits around ( 1 char) at a time — this needs to work on Windows and C ... ?
What you need is freopen(). From the manpage:
If filename is a null pointer, the freopen() function shall attempt to change the mode of the stream to that specified by mode, as if the name of the file currently associated with the stream had been used. In this case, the file descriptor associated with the stream need not be closed if the call to freopen() succeeds. It is implementation-defined which changes of mode are permitted (if any), and under what circumstances.
Basically, the best you can really do is this:
freopen(NULL, "rb", stdin);
This will reopen stdin to be the same input stream, but in binary mode. In the normal mode, reading from stdin on Windows will convert \r\n (Windows newline) to the single character ASCII 10. Using the "rb" mode disables this conversion so that you can properly read in binary data.
freopen() returns a filehandle, but it's the previous value (before we put it in binary mode), so don't use it for anything. After that, use fread() as has been mentioned.
As to your concerns, however, you may not be reading in "32 bits" but if you use fread() you will be reading in 4 chars (which is the best you can do in C - char is guaranteed to be at least 8 bits but some historical and embedded platforms have 16 bit chars (some even have 18 or worse)). If you use fgets() you will never read in 4 bytes. You will read in at least 3 (depending on whether any of them are newlines), and the 4th byte will be '\0' because C strings are nul-terminated and fgets() nul-terminates what it reads (like a good function). Obviously, this is not what you want, so you should use fread().
Consider using SET_BINARY_MODE macro and setmode:
#ifdef _WIN32
# include <io.h>
# include <fcntl.h>
# define SET_BINARY_MODE(handle) setmode(handle, O_BINARY)
#else
# define SET_BINARY_MODE(handle) ((void)0)
#endif
More details about SET_BINARY_MODE macro here: "Handling binary files via standard I/O"
More details about setmode here: "_setmode"
I had to piece the answer together from the various comments from the kind people above, so here is a fully-working sample that works - only for Windows, but you can probably translate the windows-specific stuff to your platform.
#include "stdafx.h"
#include "stdio.h"
#include "stdlib.h"
#include "windows.h"
#include <io.h>
#include <fcntl.h>
int main()
{
char rbuf[4096];
char *deffile = "c:\\temp\\outvideo.bin";
size_t r;
char *outfilename = deffile;
FILE *newin;
freopen(NULL, "rb", stdin);
_setmode(_fileno(stdin), _O_BINARY);
FILE *f = fopen(outfilename, "w+b");
if (f == NULL)
{
printf("unable to open %s\n", outfilename);
exit(1);
}
for (;; )
{
r = fread(rbuf, 1, sizeof(rbuf), stdin);
if (r > 0)
{
size_t w;
for (size_t nleft = r; nleft > 0; )
{
w = fwrite(rbuf, 1, nleft, f);
if (w == 0)
{
printf("error: unable to write %d bytes to %s\n", nleft, outfilename);
exit(1);
}
nleft -= w;
fflush(f);
}
}
else
{
Sleep(10); // wait for more input, but not in a tight loop
}
}
return 0;
}
For Windows, this Microsoft _setmode example specifically shows how to change stdin to binary mode:
// crt_setmode.c
// This program uses _setmode to change
// stdin from text mode to binary mode.
#include <stdio.h>
#include <fcntl.h>
#include <io.h>
int main( void )
{
int result;
// Set "stdin" to have binary mode:
result = _setmode( _fileno( stdin ), _O_BINARY );
if( result == -1 )
perror( "Cannot set mode" );
else
printf( "'stdin' successfully changed to binary mode\n" );
}
fgets() is all wrong here. It's aimed at human-readable ASCII text terminated by end-of-line characters, not binary data, and won't get you what you need.
I recently did exactly what you want using the read() call. Unless your program has explicitly closed stdin, for the first argument (the file descriptor), you can use a constant value of 0 for stdin. Or, if you're on a POSIX system (Linux, Mac OS X, or some other modern variant of Unix), you can use STDIN_FILENO.
fread() suits best for reading binary data.
Yes, char array is OK, if you are planning to process them bytewise.
I don't know what OS you are running, but you typically cannot "open stdin in binary". You can try things like
int fd = fdreopen (fileno (stdin), outfname, O_RDONLY | OPEN_O_BINARY);
to try to force it. Then use
uint32_t opcode;
read(fd, &opcode, sizeof (opcode));
But I have no actually tried it myself. :)
I had it right the first time, except, I needed ntohl ... C Endian Conversion : bit by bit

Resources