Reading text file with non-english character in C

Reading text file with non-english character in C - c

Is it possible to read a text file hat has non-english text?
Example of text in file:
E 37
SVAR:
Fettembolisyndrom. (1 poäng)
Example of what is present in buffer which stores "fread" output using "puts" :
E 37 SVAR:
Fettembolisyndrom.
(1 po├ñng)
Under Linux my program was working fine but in Windows I am seeing this problem with non-english letters. Any advise how this can be fixed?
Program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
int debug = 0;
int main(int argc, char* argv[])
{
if (argc < 2)
{
puts("ERROR! Please enter a filename\n");
exit(1);
}
else if (argc > 2)
{
debug = atoi(argv[2]);
puts("Debugging mode ENABLED!\n");
}
FILE *fp = fopen(argv[1], "rb");
fseek(fp, 0, SEEK_END);
long fileSz = ftell(fp);
fseek(fp, 0, SEEK_SET);
char* buffer;
buffer = (char*) malloc (sizeof(char)*fileSz);
size_t readSz = fread(buffer, 1, fileSz, fp);
rewind(fp);
if (readSz == fileSz)
{
char tmpBuff[100];
fgets(tmpBuff, 100, fp);
if (!ferror(fp))
{
printf("100 characters from text file: %s\n", tmpBuff);
}
else
{
printf("Error encounter");
}
}
if (strstr("FRÅGA",buffer) == NULL)
{
printf("String not found!");
}
return 0;
}
Sample output
Text file

Summary: If you read text from a file encoded in UTF-8 and display it on the console you must either set the console to UTF-8 or transcode the text from UTF-8 to the encoding used by the console (in English-speaking countries, usually MS-DOS code page 437 or 850).
Longer explanation
Bytes are not characters and characters are not bytes. The char data type in C holds a byte, not a character. In particular, the character Å (Unicode <U+00C5>) mentioned in the comments can be represented in many ways, called encodings:
In UTF-8 it is two bytes, '\xC3' '\x85';
In UTF-16 it is two bytes, either '\xC5' '\x00' (little-endian UTF-16), or '\x00' '\xC5' (big-endian UTF-16);
In Latin-1 and Windows-1252, it is one byte, '\xC5';
In MS-DOS code page 437 and code page 850, it is one byte, '\x8F'.
It is the responsibility of the programmer to translate between the internal encoding used by the program (usually but not always Unicode), the encoding used in input or output files, and the encoding expected by the display device.
Note: Sometimes, if the program does not do much with the characters it reads and outputs, one can get by just by making sure that the input files, the output files, and the display device all use the same encoding. In Linux, this encoding is almost always UTF-8. Unfortunately, on Windows the existence of multiple encodings is a fact of life. System calls expect either UTF-16 or Windows-1252. By default, the console displays Code Page 437 or 850. Text files are quite often in UTF-8. Windows is old and complicated.

Related

CS50 Lab4: CS50 IDE (Codespaces) works correct, while local compilation under Windows 10 fails

CS50 Lab4 code which change volume of .wav file:
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
const int HEADER_SIZE = 44;
uint8_t header[HEADER_SIZE];
int16_t buffer;
int main(int argc, char *argv[])
{
// Check command-line arguments
if (argc != 4)
{
printf("Usage: ./volume input.wav output.wav factor\n");
return 1;
}
FILE *input = fopen(argv[1], "r");
if (input == NULL)
{
printf("Could not open file.\n");
return 1;
}
FILE *output = fopen(argv[2], "w");
if (output == NULL)
{
printf("Could not open file.\n");
return 1;
}
float factor = atof(argv[3]);
fread(header, 1, HEADER_SIZE, input);
fwrite(header, 1, HEADER_SIZE, output);
int n = 0; // for debug purpose
while (fread(&buffer, 2, 1, input))
{
buffer = buffer * (float)factor;
fwrite(&buffer, 2, 1, output);
if (n == 2115) // for debug purpose
{
if (n == 2111) // for debug purpose
;
}
printf("%d\n", n++); // for debug purpose
}
// Close files
fclose(input);
fclose(output);
}
The thing is.. That it works perfectly in CS50 Codespace IDE:
but all my local compilers (I've tried: bcc32, cpp32, tcc, gcc, clang) gives the same result - output of this broken file (must be 345kb file, but it's 5 kb):
https://cdn.discordapp.com/attachments/792992622196424725/964833387363852308/output.wav
I've tried some debug:
According debug it always stops at 2117 step (4608 buffer value).
Again I wanna point that in CS50 IDE it works alright and it goest through all 176399 steps :)
feof(input) and ferror() debug:
Please help to solve this puzzle! I can't rest until I understand whats wrong there..

On Windows, you must open binary files with mode rb (or wb for writing). On Unix, you should do that for portability, but in practice it works either way. (And it has nothing to do with the compiler).
The reason is that in text files, Windows treats a byte with value 0x1A (which is Ctl-Z) as an EOF indicator. Unix doesn't do this; on Unix, the end of the file is where the file ends.
Also, Windows uses a two-character end-of-line indicator (\r\n), which must be translated to a single \n because the C standard requires that multi-character end-of-line indicators in a text file be translated to a single newline character (and translated back when you write to the file). That doesn't happen on Unix either, because Unix line endings are already a single newline character.
So on Windows, if you read a binary file without specifying the b-for-binary open mode, then the read will stop at the first 0x1A in the file. In your case, that seems to have happened on the 2117th character read, but note that that might not be the 2117th character in the file because of newline translation. You could try looking at your file with a binary editor, but the bottom line is that if you think your program might be run under Windows, then you should always use rb and wb for binary files. Unix ignores the b and it tells Windows to stop messing with your file.

File reading problem (using GCC on Linux and Cygwin)

I am using the following program to find out the size of a file and allocate memory dynamically. This program has to be multi-platform functional.
But when I run the program on Linux machine and on a Windows machine using Cygwin, I see different outputs — why?
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
/*
Observation on Linux
When reading text file remember
the content in the text file if arranged in lines like below:
ABCD
EFGH
the size of file is 12, its because each line is ended by \r\n, so add 2 bytes for every line we read.
*/
off_t fsize(char *file) {
struct stat filestat;
if (stat(file, &filestat) == 0) {
return filestat.st_size;
}
return 0;
}
void ReadInfoFromFile(char *path)
{
FILE *fp;
unsigned int size;
char *buffer = NULL;
unsigned int start;
unsigned int buff_size =0;
char ch;
int noc =0;
fp = fopen(path,"r");
start = ftell(fp);
fseek(fp,0,SEEK_END);
size = ftell(fp);
rewind(fp);
printf("file size = %u\n", size);
buffer = (char*) malloc(sizeof(char) * (size + 1) );
if(!buffer) {
printf("malloc failed for buffer \n");
return;
}
buff_size = fread(buffer,sizeof(char),size,fp);
printf(" buff_size = %u\n", buff_size);
if(buff_size == size)
printf("%s \n", buffer);
else
printf("problem in file size \n %s \n", buffer);
fclose(fp);
}
int main(int argc, char *argv[])
{
printf(" using ftell etc..\n");
ReadInfoFromFile(argv[1]);
printf(" using stat\n");
printf("File size = %u\n", fsize(argv[1]));
return 0;
}
The problem is fread reading different sizes depends on compiler.
I have not tried on proper windows compiler yet.
But what would be the portable way to read contents from file?
Output on Linux:
using ftell etc..
file size = 34
buff_size = 34
ABCDEGFH
IJKLMNOP
QRSTUVWX
YX
using stat
File size = 34
Output on Cygwin:
using ftell etc..
file size = 34
buff_size = 30
problem in file size
ABCDEGFH
IJKLMNOP
QRSTUVWX
YX
_ROAMINGPRã9œw
using stat
File size = 34

Transferring comments into an answer.
The trouble is probably that on Windows, the text file has CRLF line endings ("\r\n"). The input processing maps those to "\n" to match Unix because you use "r" in the open mode (open text file for reading) instead of "rb" (open binary file for reading). This leads to a difference in the byte counts — ftell() reports the bytes including the '\r' characters, but fread() doesn't count them.
But how can I allocate memory, if I don't know the actual size? Even in this case also the return value of fread is 30/34, but my content is only of 26 bytes.
Define your content — there's a newline or CRLF at the end of each of 4 lines. When the file is opened on Windows (Cygwin) in text mode (no b), then you will receive 3 lines of 9 bytes (8 letters and a newline) plus one line with 3 bytes (2 letters and a newline), for 30 bytes in total. Compared to the 34 that's reported by ftell() or stat(), the difference is the 4 CR characters ('\r') that are not returned. If you opened the file as a binary file ("rb"), then you'd get all 34 characters — 3 lines with 10 bytes and 1 line with 4 bytes.
The good news is that the size reported by stat() or ftell() is bigger than the final number of bytes returned, so allocating enough space is not too hard. It might become wasteful if you have a gigabyte size file with every line containing 1 byte of data and a CRLF. Then you'd "waste" (not use) one third of the allocated space. You could always shrink the allocation to the required size with realloc().
Note that there is no difference between text and binary mode on Unix-like (POSIX) systems such as Linux. It does not do mapping of CRLF to NL line endings. If the file is copied from Windows to Linux without mapping the line endings, you will get CRLF at the end of each line on Linux If the file is copied and the line endings are mapped, you'll get a smaller size on Linux than under Cygwin. (Using "rb" on Linux does no harm; it doesn't do any good either. Using "rb" on Windows/Cygwin could be important; it depends on the behaviour you want.)
See also the C11 standard §7.21.2 Streams and also §7.21.3 Files.

how to read and process utf-8 characters in one char in c from the file

how i can to read and process utf-8 characters in one char in c from the file
this is my code
FILE *file = fopen(fileName, "rb");
char *code;
size_t n = 0;
if (file == NULL) return NULL;
fseek(file, 0, SEEK_END);
long f_size = ftell(file);
fseek(file, 0, SEEK_SET);
code = malloc(f_size);
char a,b;
while (!feof(file)) {
fscanf(file, "%c", &a);
code[n++] = a;
// i want to modify "a" (current char) in here
}
code[n] = '\0';
this is file content
~”م‘‎iاk·¶;R0ثp9´
-پ‘“گAéI‚sہئzOU,HدلKŒ©َض†ُ ت6‘گA=…¢¢³qد4â9àr}hw O‍Uجy.4a³‎M;£´`د$r(q¸Œçً£F 6pG|ںJr(TîsشR

Chars can commonly hold 255 different values (1 byte), or in other words, just the ASCII table (it could use the extended table if you make it unsigned). For handling UTF-8 characters i would recommend using another type like wchar_t (if a wide character in your compiler means as an UTF-8), otherwise use char_32 if you're using C++11, or a library to deal with your data like ICU.
Edit
This example code explains how to deal with UTF-8 in C. Note that you have to make sure that wchar_t in your compiler can store an UTF-8.
#include <stdio.h>
#include <locale.h>
#include <stdlib.h>
#include <wchar.h>
main() {
FILE *file=fopen("Testing.txt", "r, ccs=UTF-8");
wchar_t sentence[100000], ch=1;
int n=0;
char*loc = setlocale(LC_ALL, "");
printf("Locale set to: %s\n", loc);
if(file==NULL){
printf("Error processing file\n");
} else {
while((ch = fgetwc(file)) != 65535){
/* The end of file value may vary depending of the wchar_t!*/
/* wprintf(L"%lc", ch); */
sentence[n]=ch+1; /*Example modification*/
n++;
}
}
fclose(file);
file=fopen("Testing.txt", "w, ccs=UTF-8");
fputws(sentence, file);
wprintf(L"%ls", sentence);
fclose(file);
return 0;
}
Your system locale
The char*loc = setlocale(LC_ALL, ""); will help you see your current system locale. Make sure is in UTF-8 if your using linux, if you're using windows then you'll have to stick to one language. This is not a problem if you don't want to print the characters.
How to open the file
Firstly, I opened it for reading it as text file instead of reading it as binary file. Also I have to open the file using the UTF-8 formating (I think in linux it will be as your locale, so the ccs=UTF-8 won't be necessary). Even though in windows we're stuck with one language, the file still has to be read in UTF-8.
Using compatible functions with the characters
For this we'll use the functions inside the wchar.h library (like wprintf and fgetwc). The problem with the other functions is that they are limited to the range of a char, giving the wrong value.
I used as an example this:
¿khñà?
hello
~”م‘‎iاk·¶;R0ثp9´ -پ‘“گAéI‚sہئzOU,HدلKŒ©َض†ُ ت6‘گA=…¢¢³qد4â9àr}hw O‍Uجy.4a³‎M;£´`د$r(q¸Œçً£F 6pG|ںJr(TîsشR
In the last part of the program It overwrites the file with the acumulated modified string.
You could try changing sentence[n]=ch+1; to sentence[n]=ch; to check in your original file if it reads and outputs the file correctly (and uncomment the wprintf to check the output).

C fopen and fgets returning weird characters instead of file contents

I am doing a coding exercise and I need to open a data file that contains lots of data. It's a .raw file. Before I build my app I open the 'card.raw' file in a texteditor and in a hexeditor. If you open it in textEdit you will see 'bit.ly/18gECvy ˇÿˇ‡JFIFHHˇ€Cˇ€Cˇ¿Vˇƒ' as the first line. (The url points to Rick Roll as a joke by the professor.)
So I start building my app to open the same 'card.raw' file. I'm doing initial checks to see the app print to the console the same "stuff" as when I open it with TextEdit. Instead of printing out I see when I open it with TextEdit (see the text above), it starts and continues printing out text that looks like this:
\377\304 'u\204\206\226\262\302\3227\205\246\266\342GSc\224\225\245\265\305\306\325\326Wgs\244\346(w\345\362\366\207\264\304ǃ\223\227\2678H\247\250\343\344\365\377\304
Now I have no idea what the '\' and numbers are called (what do I search for to read more?), why it's printing that instead of the characters (unicode?) I see when I open in TextEdit, or if I can convert this output to hex or unicode.
My code is:
#include <stdio.h>
#include <string.h>
#include <limits.h>
int main(int argc, const char * argv[]) {
FILE* file;
file = fopen("/Users/jamesgoldstein/CS50/CS50Week4/CS50Recovery/CS50Recovery/CS50Recovery/card.raw", "r");
char output[LINE_MAX];
if (file != NULL)
{
for (int i = 1; fgets(output, LINE_MAX, file) != NULL; i++)
{
printf("%s\n", output);
}
}
fclose(file);
return 0;
}
UPDATED & SIMPLIFIED CODE USING fread()
#include <stdio.h>
#include <string.h>
int main(int argc, const char * argv[]) {
FILE* fp = fopen("/Users/jamesgoldstein/CS50/CS50Week4/CS50Recovery/CS50Recovery/CS50Recovery/card.raw", "rb");
char output[256];
if (fp == NULL)
{
printf("Bad input\n");
return 1;
}
for (int i = 1; fread(output, sizeof(output), 1, fp) != NULL; i++)
{
printf("%s\n", output);
}
fclose(fp);
return 0;
}
Output is partially correct (here's a snippet of the beginning):
bit.ly/18gECvy
\377\330\377\340
\221\241\26145\301\321\341 "#&23DE\3616BFRTUe\202CVbdfrtv\222\242
'u\204\206\226\262\302\3227\205\246\266\342GSc\224\225\245\265\305\306\325\326Wgs\244\346(w\345\362\366\207\264\304ǃ\223\227\2678H\247\250\343\344\365\377\304
=\311\345\264\352\354 7\222\315\306\324+\342\364\273\274\205$z\262\313g-\343wl\306\375My:}\242o\210\377
3(\266l\356\307T饢"2\377
\267\212ǑP\2218 \344
Actual card.raw file snippet of beginning
bit.ly/18gECvy ˇÿˇ‡JFIFHHˇ€Cˇ€Cˇ¿Vˇƒ
ˇƒÖ
!1AQa$%qÅë°±45¡—· "#&23DEÒ6BFRTUeÇCVbdfrtví¢

I think you should open the .raw file in the mode "rb".
Then use fread()

From the presence of the string "JFIF" in the first line of the file card.raw ("bit.ly/18gECvy ˇÿˇ‡JFIFHHˇ€Cˇ€Cˇ¿Vˇƒ") it seems like card.raw is a JPEG image format file that had the bit.ly URL inserted at its beginning.
You are going to see weird/special characters in this case because it is not a usual text file at all.
Also, as davmac pointed out, the way you are using fgets isn't appropriate even if you were dealing with an actual text file. When dealing with plain text files in C, the best way is to read the entire file at once instead of line by line, assuming sufficient memory is available:
size_t f_len, f_actualread;
char *buffer = NULL;
fseek(file, 0, SEEK_END)
f_len = ftell(fp);
rewind(fp);
buffer = malloc(f_len + 1);
if(buffer == NULL)
{
puts("malloc failed");
return;
}
f_actualread = fread(buffer, 1, f_len, file);
buffer[f_actualread] = 0;
printf("%s\n", output);
free(buffer);
buffer = NULL;
This way, you don't need to worry about line lengths or anything like that.

You should probably use fread rather than fgets, since the latter is really designed for reading text files, and this is clearly not a text file.
Your updated code in fact does have the very problem I originally wrote about (but have since retracted), since you are now using fread rather than fgets:
for (int i = 1; fread(output, sizeof(output), 1, fp) != NULL; i++)
{
printf("%s\n", output);
}
I.e. you are printing the output buffer as if it were a null-terminated string, when in fact it is not. Better to use fwrite to STDOUT.
However, I think the essence of the problem here is trying to display arbitrary bytes (which don't actually represent a character string) to the terminal. The terminal may interpret some byte sequences as commands which affect what you see. Also, textEdit may determine that the file is in some character encoding and decode characters accordingly.
Now I have no idea what the '\' and numbers are called (what do I search for to read more?)
They look like octal escape sequences to me.
why it's printing that instead of the characters (unicode?)
It's nothing to do with unicode. Maybe it's your terminal emulator deciding that those characters are unprintable, and so replacing them with an escape sequence.
In short, I think that your method (comparing visually what you see in a text editor with what you see on the terminal) is flawed. The code you have to read from the file looks correct; I'd suggest proceeding with the exercise and checking results then, or if you really want to be sure, look at the file using a hex editor, and have your program output the byte values it reads (as numbers) - and compare those with what you see in the hex editor.

C read binary stdin

I'm trying to build an instruction pipeline simulator and I'm having a lot of trouble getting started. What I need to do is read binary from stdin, and then store it in memory somehow while I manipulate the data. I need to read in chunks of exactly 32 bits one after the other.
How do I read in chunks of exactly 32 bits at a time? Secondly, how do I store it for manipulation later?
Here's what I've got so far, but examining the binary chunks I read further, it just doesn't look right, I don't think I'm reading exactly 32 bits like I need.
char buffer[4] = { 0 }; // initialize to 0
unsigned long c = 0;
int bytesize = 4; // read in 32 bits
while (fgets(buffer, bytesize, stdin)) {
memcpy(&c, buffer, bytesize); // copy the data to a more usable structure for bit manipulation later
// more stuff
buffer[0] = 0; buffer[1] = 0; buffer[2] = 0; buffer[3] = 0; // set to zero before next loop
}
fclose(stdin);
How do I read in 32 bits at a time (they are all 1/0, no newlines etc), and what do I store it in, is char[] okay?
EDIT: I'm able to read the binary in but none of the answers produce the bits in the correct order — they are all mangled up, I suspect endianness and problems reading and moving 8 bits around ( 1 char) at a time — this needs to work on Windows and C ... ?

What you need is freopen(). From the manpage:
If filename is a null pointer, the freopen() function shall attempt to change the mode of the stream to that specified by mode, as if the name of the file currently associated with the stream had been used. In this case, the file descriptor associated with the stream need not be closed if the call to freopen() succeeds. It is implementation-defined which changes of mode are permitted (if any), and under what circumstances.
Basically, the best you can really do is this:
freopen(NULL, "rb", stdin);
This will reopen stdin to be the same input stream, but in binary mode. In the normal mode, reading from stdin on Windows will convert \r\n (Windows newline) to the single character ASCII 10. Using the "rb" mode disables this conversion so that you can properly read in binary data.
freopen() returns a filehandle, but it's the previous value (before we put it in binary mode), so don't use it for anything. After that, use fread() as has been mentioned.
As to your concerns, however, you may not be reading in "32 bits" but if you use fread() you will be reading in 4 chars (which is the best you can do in C - char is guaranteed to be at least 8 bits but some historical and embedded platforms have 16 bit chars (some even have 18 or worse)). If you use fgets() you will never read in 4 bytes. You will read in at least 3 (depending on whether any of them are newlines), and the 4th byte will be '\0' because C strings are nul-terminated and fgets() nul-terminates what it reads (like a good function). Obviously, this is not what you want, so you should use fread().

Consider using SET_BINARY_MODE macro and setmode:
#ifdef _WIN32
# include <io.h>
# include <fcntl.h>
# define SET_BINARY_MODE(handle) setmode(handle, O_BINARY)
#else
# define SET_BINARY_MODE(handle) ((void)0)
#endif
More details about SET_BINARY_MODE macro here: "Handling binary files via standard I/O"
More details about setmode here: "_setmode"

I had to piece the answer together from the various comments from the kind people above, so here is a fully-working sample that works - only for Windows, but you can probably translate the windows-specific stuff to your platform.
#include "stdafx.h"
#include "stdio.h"
#include "stdlib.h"
#include "windows.h"
#include <io.h>
#include <fcntl.h>
int main()
{
char rbuf[4096];
char *deffile = "c:\\temp\\outvideo.bin";
size_t r;
char *outfilename = deffile;
FILE *newin;
freopen(NULL, "rb", stdin);
_setmode(_fileno(stdin), _O_BINARY);
FILE *f = fopen(outfilename, "w+b");
if (f == NULL)
{
printf("unable to open %s\n", outfilename);
exit(1);
}
for (;; )
{
r = fread(rbuf, 1, sizeof(rbuf), stdin);
if (r > 0)
{
size_t w;
for (size_t nleft = r; nleft > 0; )
{
w = fwrite(rbuf, 1, nleft, f);
if (w == 0)
{
printf("error: unable to write %d bytes to %s\n", nleft, outfilename);
exit(1);
}
nleft -= w;
fflush(f);
}
}
else
{
Sleep(10); // wait for more input, but not in a tight loop
}
}
return 0;
}

For Windows, this Microsoft _setmode example specifically shows how to change stdin to binary mode:
// crt_setmode.c
// This program uses _setmode to change
// stdin from text mode to binary mode.
#include <stdio.h>
#include <fcntl.h>
#include <io.h>
int main( void )
{
int result;
// Set "stdin" to have binary mode:
result = _setmode( _fileno( stdin ), _O_BINARY );
if( result == -1 )
perror( "Cannot set mode" );
else
printf( "'stdin' successfully changed to binary mode\n" );
}

fgets() is all wrong here. It's aimed at human-readable ASCII text terminated by end-of-line characters, not binary data, and won't get you what you need.
I recently did exactly what you want using the read() call. Unless your program has explicitly closed stdin, for the first argument (the file descriptor), you can use a constant value of 0 for stdin. Or, if you're on a POSIX system (Linux, Mac OS X, or some other modern variant of Unix), you can use STDIN_FILENO.

fread() suits best for reading binary data.
Yes, char array is OK, if you are planning to process them bytewise.

I don't know what OS you are running, but you typically cannot "open stdin in binary". You can try things like
int fd = fdreopen (fileno (stdin), outfname, O_RDONLY | OPEN_O_BINARY);
to try to force it. Then use
uint32_t opcode;
read(fd, &opcode, sizeof (opcode));
But I have no actually tried it myself. :)

I had it right the first time, except, I needed ntohl ... C Endian Conversion : bit by bit

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight