How to read unicode (utf-8) / binary file line by line

How to read unicode (utf-8) / binary file line by line - c

Hi programmers,
I want read line by line a Unicode (UTF-8) text file created by Notepad, i don't want display the Unicode string in the screen, i want just read and compare the strings!.
This code read ANSI file line by line, and compare the strings
What i want
Read test_ansi.txt line by line
if the line = "b" print "YES!"
else print "NO!"
read_ansi_line_by_line.c
#include <stdio.h>
int main()
{
char *inname = "test_ansi.txt";
FILE *infile;
char line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */
char line_number;
infile = fopen(inname, "r");
if (!infile) {
printf("\nfile '%s' not found\n", inname);
return 0;
}
printf("\n%s\n\n", inname);
line_number = 0;
while (fgets(line_buffer, sizeof(line_buffer), infile)) {
++line_number;
/* note that the newline is in the buffer */
if (strcmp("b\n", line_buffer) == 0 ){
printf("%d: YES!\n", line_number);
}else{
printf("%d: NO!\n", line_number,line_buffer);
}
}
printf("\n\nTotal: %d\n", line_number);
return 0;
}
test_ansi.txt
a
b
c
Compiling
gcc -o read_ansi_line_by_line read_ansi_line_by_line.c
Output
test_ansi.txt
1: NO!
2: YES!
3: NO!
Total: 3
Now i need read Unicode (UTF-8) file created by Notepad, after more than 6 months i don't found any good code/library in C can read file coded in UTF-8!, i don't know exactly why but i think the standard C don't support Unicode!
Reading Unicode binary file its OK!, but the probleme is the binary file most be already created in binary mode!, that mean if we want read a Unicode (UTF-8) file created by Notepad we need to translate it from UTF-8 file to BINARY file!
This code write Unicode string to a binary file, NOTE the C file is coded in UTF-8 and compiled by GCC
What i want
Write the Unicode char "ب" to test_bin.dat
create_bin.c
#define UNICODE
#ifdef UNICODE
#define _UNICODE
#else
#define _MBCS
#endif
#include <stdio.h>
#include <wchar.h>
int main()
{
/*Data to be stored in file*/
wchar_t line_buffer[BUFSIZ]=L"ب";
/*Opening file for writing in binary mode*/
FILE *infile=fopen("test_bin.dat","wb");
/*Writing data to file*/
fwrite(line_buffer, 1, 13, infile);
/*Closing File*/
fclose(infile);
return 0;
}
Compiling
gcc -o create_bin create_bin.c
Output
create test_bin.dat
Now i want read the binary file line by line and compare!
What i want
Read test_bin.dat line by line
if the line = "ب" print "YES!"
else print "NO!"
read_bin_line_by_line.c
#define UNICODE
#ifdef UNICODE
#define _UNICODE
#else
#define _MBCS
#endif
#include <stdio.h>
#include <wchar.h>
int main()
{
wchar_t *inname = L"test_bin.dat";
FILE *infile;
wchar_t line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */
infile = _wfopen(inname,L"rb");
if (!infile) {
wprintf(L"\nfile '%s' not found\n", inname);
return 0;
}
wprintf(L"\n%s\n\n", inname);
/*Reading data from file into temporary buffer*/
while (fread(line_buffer,1,13,infile)) {
/* note that the newline is in the buffer */
if ( wcscmp ( L"ب" , line_buffer ) == 0 ){
wprintf(L"YES!\n");
}else{
wprintf(L"NO!\n", line_buffer);
}
}
/*Closing File*/
fclose(infile);
return 0;
}
Output
test_bin.dat
YES!
THE PROBLEM
This method is VERY LONG! and NOT POWERFUL (i m beginner in software engineering)
Please any one know how to read Unicode file ? (i know its not easy!)
Please any one know how to convert Unicode file to Binary file ? (simple method)
Please any one know how to read Unicode file in binary mode ? (i m not sure)
Thank You.

A nice property of UTF-8 is that you do not need to decode in order to compare it. The order returned from strcmp will be the same whether you decode it first or not. So just read it as raw bytes and run strcmp.

I found a solution to my problem, and I would like to share the solution to any one interested in reading UTF-8 file in C99.
void ReadUTF8(FILE* fp)
{
unsigned char iobuf[255] = {0};
while( fgets((char*)iobuf, sizeof(iobuf), fp) )
{
size_t len = strlen((char *)iobuf);
if(len > 1 && iobuf[len-1] == '\n')
iobuf[len-1] = 0;
len = strlen((char *)iobuf);
printf("(%d) \"%s\" ", len, iobuf);
if( iobuf[0] == '\n' )
printf("Yes\n");
else
printf("No\n");
}
}
void ReadUTF16BE(FILE* fp)
{
}
void ReadUTF16LE(FILE* fp)
{
}
int main()
{
FILE* fp = fopen("test_utf8.txt", "r");
if( fp != NULL)
{
// see http://en.wikipedia.org/wiki/Byte-order_mark for explaination of the BOM
// encoding
unsigned char b[3] = {0};
fread(b,1,2, fp);
if( b[0] == 0xEF && b[1] == 0xBB)
{
fread(b,1,1,fp); // 0xBF
ReadUTF8(fp);
}
else if( b[0] == 0xFE && b[1] == 0xFF)
{
ReadUTF16BE(fp);
}
else if( b[0] == 0 && b[1] == 0)
{
fread(b,1,2,fp);
if( b[0] == 0xFE && b[1] == 0xFF)
ReadUTF16LE(fp);
}
else
{
// we don't know what kind of file it is, so assume its standard
// ascii with no BOM encoding
rewind(fp);
ReadUTF8(fp);
}
}
fclose(fp);
}

fgets() can decode UTF-8 encoded files if you use Visual Studio 2005 and up. Change your code like this:
infile = fopen(inname, "r, ccs=UTF-8");

I know I am bad... but you don't even take under consideration BOM! Most examples here will fail.
EDIT:
Byte Order Marks are a few bytes at the beginnig of the file, which can be used to identify the encoding of the file. Some editors add them, and many times they just break things in faboulous ways (I remember fighting a PHP headers problems for several minutes because of this issue).
Some RTFM:
http://en.wikipedia.org/wiki/Byte_order_mark
http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx
What is XML BOM and how do I detect it?

In this article a coding and decoding routine is written and
it is explained how the unicode is encoded:
http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451/
It can be easily adjusted to C.
Simply encode your ANSI or decode the UTF-8 String and make a byte
compare
EDIT: After the OP said that it is too hard to rewrite the function from C++
here a template:
What is needed:
+ Free the allocated memory (or wait till the process ends or ignore it)
+ Add the 4 byte functions
+ Tell me that short and int is not guaranteed to be 2 and 4 bytes long (I know, but
C is really stupid !) and finally
+ Find some other errors
#include <stdlib.h>
#include <string.h>
#define MASKBITS 0x3F
#define MASKBYTE 0x80
#define MASK2BYTES 0xC0
#define MASK3BYTES 0xE0
#define MASK4BYTES 0xF0
#define MASK5BYTES 0xF8
#define MASK6BYTES 0xFC
char* UTF8Encode2BytesUnicode(unsigned short* input)
{
int size = 0,
cindex = 0;
while (input[size] != 0)
size++;
// Reserve enough place; The amount of
char* result = (char*) malloc(size);
for (int i=0; i<size; i++)
{
// 0xxxxxxx
if(input[i] < 0x80)
{
result[cindex++] = ((char) input[i]);
}
// 110xxxxx 10xxxxxx
else if(input[i] < 0x800)
{
result[cindex++] = ((char)(MASK2BYTES | input[i] >> 6));
result[cindex++] = ((char)(MASKBYTE | input[i] & MASKBITS));
}
// 1110xxxx 10xxxxxx 10xxxxxx
else if(input[i] < 0x10000)
{
result[cindex++] = ((char)(MASK3BYTES | input[i] >> 12));
result[cindex++] = ((char)(MASKBYTE | input[i] >> 6 & MASKBITS));
result[cindex++] = ((char)(MASKBYTE | input[i] & MASKBITS));
}
}
}
wchar_t* UTF8Decode2BytesUnicode(char* input)
{
int size = strlen(input);
wchar_t* result = (wchar_t*) malloc(size*sizeof(wchar_t));
int rindex = 0,
windex = 0;
while (rindex < size)
{
wchar_t ch;
// 1110xxxx 10xxxxxx 10xxxxxx
if((input[rindex] & MASK3BYTES) == MASK3BYTES)
{
ch = ((input[rindex] & 0x0F) << 12) | (
(input[rindex+1] & MASKBITS) << 6)
| (input[rindex+2] & MASKBITS);
rindex += 3;
}
// 110xxxxx 10xxxxxx
else if((input[rindex] & MASK2BYTES) == MASK2BYTES)
{
ch = ((input[rindex] & 0x1F) << 6) | (input[rindex+1] & MASKBITS);
rindex += 2;
}
// 0xxxxxxx
else if(input[rindex] < MASKBYTE)
{
ch = input[rindex];
rindex += 1;
}
result[windex] = ch;
}
}
char* getUnicodeToUTF8(wchar_t* myString) {
int size = sizeof(wchar_t);
if (size == 1)
return (char*) myString;
else if (size == 2)
return UTF8Encode2BytesUnicode((unsigned short*) myString);
else
return UTF8Encode4BytesUnicode((unsigned int*) myString);
}

just to settle the BOM argument. Here is a file from notepad
[paul#paul-es5 tests]$ od -t x1 /mnt/hgfs/cdrive/test.txt
0000000 ef bb bf 61 0d 0a 62 0d 0a 63
0000012
with a BOM at the start
Personally I dont think there should be a BOM (since its a byte format) but thats not the point

Related

Manual "encryption" only outputs 0's to file

My assignment (not homework, just a "try if you can do this" thing) is to use bit operations to encrypt and decrypt a .txt file.
This is the program. It successfully opens files for read/write but puts all 0's and spaces into the output.txt file instead of the expected "encrypted" text. I am guessing the issue comes from a fundamental misunderstanding of either data types or putc(). I understand that it outputs an unsigned char, but my professor said that an unsigned char is nothing but an unsigned int -- not sure if this is purely true or if it was a pedagogical simplification. Thanks a lot for any help.
#include <stdio.h>
#include <ctype.h>
#include <string.h>
#define NUMARG 3
#define INFILEARG 1
#define OUTFILEARG 2
int main(int argc, char *argv[]){
/* Function prototypes */
unsigned int encryptDecrypt(unsigned int x, unsigned int ed);
const char *get_filename_ext(const char *filename);
FILE *finp;
FILE *foutp;
//ed for encryption/decryption choice
unsigned int ed, c;
const char *ext;
//Check for errors in argument number and file opening.
if(argc != NUMARG){
printf("You have to put the input and output files after the
program name.\n");
return(1);
}
if( (finp = fopen(argv[INFILEARG], "r")) == NULL ){
printf("Couldn't open %s for reading.\n", argv[INFILEARG]);
return(1);
}
if( (foutp = fopen(argv[OUTFILEARG], "w")) == NULL){
printf("Couldn't open %s for writing.\n", argv[OUTFILEARG]);
return(1);
}
//Get and check file extension.
ext = get_filename_ext(argv[INFILEARG]);
if(strcmp(ext, "txt")){
printf("Input file is not a .txt file.\n");
return(1);
}
ext = get_filename_ext(argv[OUTFILEARG]);
if(strcmp(ext, "txt")){
printf("Output file is not a .txt file.\n");
return(1);
}
//Get command to encrypt or decrypt.
do{
printf("Enter e to encrypt, d to decrypt: ");
ed = getchar();
} while(ed != 'e' && ed != 'd');
//Send characters to output file.
while((c = getc(finp)) != EOF ){
putc(encryptDecrypt(c, ed), foutp);
}
// Close files.
if (fclose(finp) == EOF){
printf("Error closing input file.\n");
}
if (fclose(foutp) == EOF){
printf("Error closing output file.\n");
}
if ( ed == 'e'){
printf("Encrypted data written.\n");
} else {
printf("Data decrypted.\n");
}
return 0;
}
const char *get_filename_ext(const char *filename) {
const char *dot = strrchr(filename, '.');
if(!dot || dot == filename) return "";
return dot + 1;
}
unsigned int encryptDecrypt(unsigned int c, unsigned int ed){
if( ed == 'e' ){
printf("%d before operated on.\n", c);
c &= 134;
printf("%d after &134.\n", c);
c ^= 6;
printf("%d after ^6. \n", c);
c <<= 3;
printf("%d after <<3\n", c);
}
else {
c >>= 3;
c ^= 6;
c &= 134;
}
return c;
}
Output:
ZacBook:bitoperations $ cat input1.txt
Please encrypt this message.
ZacBook:bitoperations $ ./encrypt.o input1.txt output.txt
Enter e to encrypt, d to decrypt: e
80 before operated on.
0 after &134.
6 after ^6.
48 after <<3
108 before operated on.
4 after &134.
2 after ^6.
16 after <<3
[...Many more of these debug lines]
2 after &134.
4 after ^6.
32 after <<3
Encrypted data written.
ZacBook:bitoperations $ cat output.txt
00 0 00000 0 0
As you can see, the unsigned int is being operated on successfully. I believe the problem is with putc() but I have tried changing the type of c to char and int and neither have worked.

Your main problem is that &= is a lossy transformation: that is you lose data.
Ditto <<= and >>=, as both cause extreme 1 bits to be lost.
You'll have more luck sticking to XOR; at first at least. That's because x ^ y ^ y is x.
You can eliminate putc &c. by isolating the encryption / decryption processes from the data acquisition stages, and by hardcoding the inputs whilst your getting things working.

Read 8 Bit Binary Numbers from text File in C

I'm trying to read from a text file in C that contains a list of 8 bit binary numbers to be used in another function.
The text file is formatted like:
01101101
10110110
10101101
01001111
11010010
00010111
00101011
Ect. . .
Heres kinda what i was trying to do
Pseudo code
void bincalc(char 8_bit_num){
//does stuff
}
int main()
{
FILE* f = fopen("test.txt", "r");
int n = 0, i = 0;
while( fscanf(f, "%d ", &n) > 0 ) // parse %d followed by a new line or space
{
bincalc(n);
}
fclose(f);
}
I think i'm on the right track, however any help is appreciated.

This is not the standard track to do it, but it ok. You are scanning the file reading ints so it will read the string and interpret them as decimal number, which you should in turn convert to the corresponding binary, i.e. converting 1010 decimal to 2^3+2^1=9 decimal. This is of course possible, you just need to transform powers of ten to powers of 2 (1.10^3+0.10^2+1.10^1+0.10^0 to 1.2^3+0.2^2+1.2^1+0.2^0). Be careful that this work with 8-bits numbers, but not with too huge ones (16-bits will not).
The more common way is to read the strings and decode the strings directly, or read char by char and make an incremental conversion.

If you want to have a number at the end, I would suggest something like the following:
int i;
char buf[10], val;
FILE* f = fopen("test.txt", "r");
while( ! feof(f) ) {
val = 0;
fscanf(f, "%s", buf);
for( i=0; i<8; i++ )
val = (val << 1) | (buf[i]-48);
/* val is a 8 bit number now representing the binary digits */
bincalc(val);
}
fclose(f);
This is just a short snippet to illustrate the idea and does not catch all corner cases regarding the file or buffer handling. I hope it will help anyway.

I think this is a possible solution, i made it simple so you can understand and adapt to your needs. You only need to repeat this for each line.
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
FILE *file =fopen("data","r");
char *result=calloc(1,sizeof(char));
char line[8];
int i=0;
for(;i<8;i++)
{
char get = (char)getc(file);
if(get == '0')
*result <<= 1;
else if(get == '1')
*result = ((*result<<1)|0x1) ;
}
printf("->%c\n",*result);
return 0;
}

I don't know of a way to specify binary format for fscanf() but you can convert a binary string like this.
#include <stdio.h>
#include <stdlib.h>
void fatal(char *msg) {
printf("%s\n", msg);
exit (1);
}
unsigned binstr2int(char *str) {
int i = 0;
while (*str != '\0' && *str != '\n') { // string term or newline?
if (*str != '0' && *str != '1') // binary digit?
fatal ("Not a binary digit\n");
i = i * 2 + (*str++ & 1);
}
return i;
}
int main(void) {
unsigned x;
char inp[100];
FILE *fp;
if ((fp = fopen("test.txt", "r")) == NULL)
fatal("Unable to open file");
while (fgets(inp, 99, fp) != NULL) {
x = binstr2int(inp);
printf ("%X\n", x);
}
fclose (fp);
return 0;
}
File input
01101101
10110110
10101101
01001111
11010010
00010111
00101011
Program output
6D
B6
AD
4F
D2
17
2B

Read the file line by line, then use strtol() with base 2.
Or, if the file is guaranteed to be not long, you could also read the full contents in (dynamically allocated) memory and use the endptr parameter.

Get the length of each line in file with C and write in output file

I am a biology student and I am trying to learn perl, python and C and also use the scripts in my work. So, I have a file as follows:
>sequence1
ATCGATCGATCG
>sequence2
AAAATTTT
>sequence3
CCCCGGGG
The output should look like this, that is the name of each sequence and the count of characters in each line and printing the total number of sequences in the end of the file.
sequence1 12
sequence2 8
sequence3 8
Total number of sequences = 3
I could make the perl and python scripts work, this is the python script as an example:
#!/usr/bin/python
import sys
my_file = open(sys.argv[1]) #open the file
my_output = open(sys.argv[2], "w") #open output file
total_sequence_counts = 0
for line in my_file:
if line.startswith(">"):
sequence_name = line.rstrip('\n').replace(">","")
total_sequence_counts += 1
continue
dna_length = len(line.rstrip('\n'))
my_output.write(sequence_name + " " + str(dna_length) + '\n')
my_output.write("Total number of sequences = " + str(total_sequence_counts) + '\n')
Now, I want to write the same script in C, this is what I have achieved so far:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char *argv[])
{
input = FILE *fopen(const char *filename, "r");
output = FILE *fopen(const char *filename, "w");
double total_sequence_counts = 0;
char sequence_name[];
char line [4095]; // set a temporary line length
char buffer = (char *) malloc (sizeof(line) +1); // allocate some memory
while (fgets(line, sizeof(line), filename) != NULL) { // read until new line character is not found in line
buffer = realloc(*buffer, strlen(line) + strlen(buffer) + 1); // realloc buffer to adjust buffer size
if (buffer == NULL) { // print error message if memory allocation fails
printf("\n Memory error");
return 0;
}
if (line[0] == ">") {
sequence_name = strcpy(sequence_name, &line[1]);
total_sequence_counts += 1
}
else {
double length = strlen(line);
fprintf(output, "%s \t %ld", sequence_name, length);
}
fprintf(output, "%s \t %ld", "Total number of sequences = ", total_sequence_counts);
}
int fclose(FILE *input); // when you are done working with a file, you should close it using this function.
return 0;
int fclose(FILE *output);
return 0;
}
But this code, of course is full of mistakes, my problem is that despite studying a lot, I still can't properly understand and use the memory allocation and pointers so I know I especially have mistakes in that part. It would be great if you could comment on my code and see how it can turn into a script that actually work. By the way, in my actual data, the length of each line is not defined so I need to use malloc and realloc for that purpose.

For a simple program like this, where you look at short lines one at a time, you shouldn't worry about dynamic memory allocation. It is probably good enough to use local buffers of a reasonable size.
Another thing is that C isn't particularly suited for quick-and-dirty string processing. For example, there isn't a strstrip function in the standard library. You usually end up implementing such behaviour yourself.
An example implementation looks like this:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#define MAXLEN 80 /* Maximum line length, including null terminator */
int main(int argc, char *argv[])
{
FILE *in;
FILE *out;
char line[MAXLEN]; /* Current line buffer */
char ref[MAXLEN] = ""; /* Sequence reference buffer */
int nseq = 0; /* Sequence counter */
if (argc != 3) {
fprintf(stderr, "Usage: %s infile outfile\n", argv[0]);
exit(1);
}
in = fopen(argv[1], "r");
if (in == NULL) {
fprintf(stderr, "Couldn't open %s.\n", argv[1]);
exit(1);
}
out = fopen(argv[2], "w");
if (in == NULL) {
fprintf(stderr, "Couldn't open %s for writing.\n", argv[2]);
exit(1);
}
while (fgets(line, sizeof(line), in)) {
int len = strlen(line);
/* Strip whitespace from end */
while (len > 0 && isspace(line[len - 1])) len--;
line[len] = '\0';
if (line[0] == '>') {
/* First char is '>': copy from second char in line */
strcpy(ref, line + 1);
} else {
/* Other lines are sequences */
fprintf(out, "%s: %d\n", ref, len);
nseq++;
}
}
fprintf(out, "Total number of sequences. %d\n", nseq);
fclose(in);
fclose(out);
return 0;
}
A lot of code is about enforcing arguments and opening and closing files. (You could cut out a lot of code if you used stdin and stdout with file redirections.)
The core is the big while loop. Things to note:
fgets returns NULL on error or when the end of file is reached.
The first lines determine the length of the line and then remove white-space from the end.
It is not enough to decrement length, at the end the stripped string must be terminated with the null character '\0'
When you check the first character in the line, you should check against a char, not a string. In C, single and double quotes are not interchangeable. ">" is a string literal of two characters, '>' and the terminating '\0'.
When dealing with countable entities like chars in a string, use integer types, not floating-point numbers. (I've used (signed) int here, but because there can't be a negative number of chars in a line, it might have been better to have used an unsigned type.)
The notation line + 1 is equivalent to &line[1].
The code I've shown doesn't check that there is always one reference per sequence. I'll leave this as exercide to the reader.
For a beginner, this can be quite a lot to keep track of. For small text-processing tasks like yours, Python and Perl are definitely better suited.
Edit: The solution above won't work for long sequences; it is restricted to MAXLEN characters. But you don't need dynamic allocation if you only need the length, not the contents of the sequences.
Here's an updated version that doesn't read lines, but read characters instead. In '>' context, it stored the reference. Otherwise it just keeps a count:
#include <stdlib.h>
#include <stdio.h>
#include <ctype.h> /* for isspace() */
#define MAXLEN 80 /* Maximum line length, including null terminator */
int main(int argc, char *argv[])
{
FILE *in;
FILE *out;
int nseq = 0; /* Sequence counter */
char ref[MAXLEN]; /* Reference name */
in = fopen(argv[1], "r");
out = fopen(argv[2], "w");
/* Snip: Argument and file checking as above */
while (1) {
int c = getc(in);
if (c == EOF) break;
if (c == '>') {
int n = 0;
c = fgetc(in);
while (c != EOF && c != '\n') {
if (n < sizeof(ref) - 1) ref[n++] = c;
c = fgetc(in);
}
ref[n] = '\0';
} else {
int len = 0;
int n = 0;
while (c != EOF && c != '\n') {
n++;
if (!isspace(c)) len = n;
c = fgetc(in);
}
fprintf(out, "%s: %d\n", ref, len);
nseq++;
}
}
fprintf(out, "Total number of sequences. %d\n", nseq);
fclose(in);
fclose(out);
return 0;
}
Notes:
fgetc reads a single byte from a file and returns this byte or EOF when the file has ended. In this implementation, that's the only reading function used.
Storing a reference string is implemented via fgetc here too. You could probably use fgets after skipping the initial angle bracket, too.
The counting just reads bytes without storing them. n is the total count, len is the count up to the last non-space. (Your lines probably consist only of ACGT without any trailing space, so you could skip the test for space and use n instead of len.)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char *argv[]){
FILE *my_file = fopen(argv[1], "r");
FILE *my_output = fopen(argv[2], "w");
int total_sequence_coutns = 0;
char *sequence_name;
int dna_length;
char *line = NULL;
size_t size = 0;
while(-1 != getline(&line, &size, my_file)){
if(line[0] == '>'){
sequence_name = strdup(strtok(line, ">\n"));
total_sequence_coutns +=1;
continue;
}
dna_length = strlen(strtok(line, "\n"));
fprintf(my_output, "%s %d\n", sequence_name, dna_length);
free(sequence_name);
}
fprintf(my_output, "Total number of sequences = %d\n", total_sequence_coutns);
fclose(my_file);
fclose(my_output);
free(line);
return (0);
}

How to Read/Write UTF8 text files in C?

i am trying to read UTF8 text from a text file, and then print some of it to another file. I am using Linux and gcc compiler. This is the code i am using:
#include <stdio.h>
#include <stdlib.h>
int main(){
FILE *fin;
FILE *fout;
int character;
fin=fopen("in.txt", "r");
fout=fopen("out.txt","w");
while((character=fgetc(fin))!=EOF){
putchar(character); // It displays the right character (UTF8) in the terminal
fprintf(fout,"%c ",character); // It displays weird characters in the file
}
fclose(fin);
fclose(fout);
printf("\nFile has been created...\n");
return 0;
}
It works for English characters for now.

Instead of
fprintf(fout,"%c ",character);
use
fprintf(fout,"%c",character);
The second fprintf() does not contain a space after %c which is what was causing out.txt to display weird characters. The reason is that fgetc() is retrieving a single byte (the same thing as an ASCII character), not a UTF-8 character. Since UTF-8 is also ASCII compatible, it will write English characters to the file just fine.
putchar(character) output the bytes sequentially without the extra space between every byte so the original UTF-8 sequence remained intact. To see what I'm talking about, try
while((character=fgetc(fin))!=EOF){
putchar(character);
printf(" "); // This mimics what you are doing when you write to out.txt
fprintf(fout,"%c ",character);
}
If you want to write UTF-8 characters with the space between them to out.txt, you would need to handle the variable length encoding of a UTF-8 character.
#include <stdio.h>
#include <stdlib.h>
/* The first byte of a UTF-8 character
* indicates how many bytes are in
* the character, so only check that
*/
int numberOfBytesInChar(unsigned char val) {
if (val < 128) {
return 1;
} else if (val < 224) {
return 2;
} else if (val < 240) {
return 3;
} else {
return 4;
}
}
int main(){
FILE *fin;
FILE *fout;
int character;
fin = fopen("in.txt", "r");
fout = fopen("out.txt","w");
while( (character = fgetc(fin)) != EOF) {
for (int i = 0; i < numberOfBytesInChar((unsigned char)character) - 1; i++) {
putchar(character);
fprintf(fout, "%c", character);
character = fgetc(fin);
}
putchar(character);
printf(" ");
fprintf(fout, "%c ", character);
}
fclose(fin);
fclose(fout);
printf("\nFile has been created...\n");
return 0;
}

This code worked for me:
/* fgetwc example */
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
int main ()
{
setlocale(LC_ALL, "en_US.UTF-8");
FILE * fin;
FILE * fout;
wint_t wc;
fin=fopen ("in.txt","r");
fout=fopen("out.txt","w");
while((wc=fgetwc(fin))!=WEOF){
// work with: "wc"
}
fclose(fin);
fclose(fout);
printf("File has been created...\n");
return 0;
}

If you do not wish to use the wide options, experiment with the following:
Read and write bytes, not characters.
Also known as, use binary, not text.
fgetc effectively gets a byte from a file, but if the byte is greater than 127, try treating it as a int instead of a char.
fputc, on the other hand, silently ignores putting a char > 127. It will work if you use an int rather than char as the input.
Also, in the open mode, try using binary, so try rb & wb rather than r & w

The C-style solution is very insightful, but if you'd consider using C++ the task becomes much more high level and it does not require you to have so much knowledge about utf-8 encoding. Consider the following:
#include<iostream>
#include<fstream>
int main(){
wifstream input { "in.txt" }
wofstream output { "out.txt" }
// Look out - this part is not portable to windows
locale utf8 {"en_us.UTF-8"};
input.imbue(utf8);
output.imbue(utf8);
wcout.imbue(utf8);
wchar_t c;
while(input >> noskipws >> c) {
wcout << c;
output << c;
}
return 0;
}

Reading a text file backwards in C

What's the best way to read a file backwards in C? I know at first you may be thinking that this is no use whatsoever, but most logs etc. append the most recent data at the end of the file. I want to read in text from the file backwards, buffering it into lines - that is
abc
def
ghi
should read ghi, def, abc in lines.
So far I have tried:
#include <stdio.h>
#include <stdlib.h>
void read_file(FILE *fileptr)
{
char currentchar = '\0';
int size = 0;
while( currentchar != '\n' )
{
currentchar = fgetc(fileptr); printf("%c\n", currentchar);
fseek(fileptr, -2, SEEK_CUR);
if( currentchar == '\n') { fseek(fileptr, -2, SEEK_CUR); break; }
else size++;
}
char buffer[size]; fread(buffer, 1, size, fileptr);
printf("Length: %d chars\n", size);
printf("Buffer: %s\n", buffer);
}
int main(int argc, char *argv[])
{
if( argc < 2) { printf("Usage: backwards [filename]\n"); return 1; }
FILE *fileptr = fopen(argv[1], "rb");
if( fileptr == NULL ) { perror("Error:"); return 1; }
fseek(fileptr, -1, SEEK_END); /* Seek to END of the file just before EOF */
read_file(fileptr);
return 0;
}
In an attempt to simply read one line and buffer it. Sorry that my code is terrible, I am getting so very confused. I know that you would normally allocate memory for the whole file and then read in the data, but for large files that constantly change I thought it would be better to read directly (especially if I want to search for text in a file).
Thanks in advance
* Sorry forgot to mention this will be used on Linux, so newlines are just NL without CR. *

You could just pipe the input through the program tac, which is like cat but backwards!
http://linux.die.net/man/1/tac

I recommend a more portable (hopefully) way of file size determination since fseek(binaryStream, offset, SEEK_END) is not guaranteed to work. See the code below.
I believe that files should be at least minimally buffered at the kernel level (e.g. buffering at least one block per file by default), so seeks should not incur significant amount of extra I/O and should only advance the file position internally. If the default buffering is not satisfactory, you may try to use setvbuf() to speed up the I/O.
#include <limits.h>
#include <string.h>
#include <stdio.h>
/* File must be open with 'b' in the mode parameter to fopen() */
long fsize(FILE* binaryStream)
{
long ofs, ofs2;
int result;
if (fseek(binaryStream, 0, SEEK_SET) != 0 ||
fgetc(binaryStream) == EOF)
return 0;
ofs = 1;
while ((result = fseek(binaryStream, ofs, SEEK_SET)) == 0 &&
(result = (fgetc(binaryStream) == EOF)) == 0 &&
ofs <= LONG_MAX / 4 + 1)
ofs *= 2;
/* If the last seek failed, back up to the last successfully seekable offset */
if (result != 0)
ofs /= 2;
for (ofs2 = ofs / 2; ofs2 != 0; ofs2 /= 2)
if (fseek(binaryStream, ofs + ofs2, SEEK_SET) == 0 &&
fgetc(binaryStream) != EOF)
ofs += ofs2;
/* Return -1 for files longer than LONG_MAX */
if (ofs == LONG_MAX)
return -1;
return ofs + 1;
}
/* File must be open with 'b' in the mode parameter to fopen() */
/* Set file position to size of file before reading last line of file */
char* fgetsr(char* buf, int n, FILE* binaryStream)
{
long fpos;
int cpos;
int first = 1;
if (n <= 1 || (fpos = ftell(binaryStream)) == -1 || fpos == 0)
return NULL;
cpos = n - 1;
buf[cpos] = '\0';
for (;;)
{
int c;
if (fseek(binaryStream, --fpos, SEEK_SET) != 0 ||
(c = fgetc(binaryStream)) == EOF)
return NULL;
if (c == '\n' && first == 0) /* accept at most one '\n' */
break;
first = 0;
if (c != '\r') /* ignore DOS/Windows '\r' */
{
unsigned char ch = c;
if (cpos == 0)
{
memmove(buf + 1, buf, n - 2);
++cpos;
}
memcpy(buf + --cpos, &ch, 1);
}
if (fpos == 0)
{
fseek(binaryStream, 0, SEEK_SET);
break;
}
}
memmove(buf, buf + cpos, n - cpos);
return buf;
}
int main(int argc, char* argv[])
{
FILE* f;
long sz;
if (argc < 2)
{
printf("filename parameter required\n");
return -1;
}
if ((f = fopen(argv[1], "rb")) == NULL)
{
printf("failed to open file \'%s\'\n", argv[1]);
return -1;
}
sz = fsize(f);
// printf("file size: %ld\n", sz);
if (sz > 0)
{
char buf[256];
fseek(f, sz, SEEK_SET);
while (fgetsr(buf, sizeof(buf), f) != NULL)
printf("%s", buf);
}
fclose(f);
return 0;
}
I've only tested this on windows with 2 different compilers.

There are quite a few ways you could do this, but reading a byte at a time is definitely one of poorer choices.
Reading the last, say, 4KB and then walking back up from the last character to the previous newline would be my choice.
Another option is to mmap the file, and just pretend that the file is a lump of memory, and scan backwards in that. [You can tell mmap you are reading backwards too, to make it prefetch data for you].
If the file is VERY large (several gigabytes), you may want to only use a small portion of the file in mmap.

If you want to learn how to do it, here's a Debian/Ubuntu example (for other like RPM based distros, adapt as needed):
~$ which tac
/usr/bin/tac
~$ dpkg -S /usr/bin/tac
coreutils: /usr/bin/tac
~$ mkdir srcs
~$ cd srcs
~/srcs$ apt-get source coreutils
(clip apt-get output)
~/srcs$ ls
coreutils-8.13 coreutils_8.13-3.2ubuntu2.1.diff.gz coreutils_8.13-3.2ubuntu2.1.dsc coreutils_8.13.orig.tar.gz
~/srcs$ cd coreutils-8.13/
~/srcs/coreutils-8.13$ find . -name tac.c
./src/tac.c
~/srcs/coreutils-8.13$ less src/tac.c
That's not too long, a bit over 600 lines, and while it packs some advanced features, and uses functions from other sources, the reverse line buffering implementation seems to be in that tac.c source file.

FSEEKing for every byte sounds PAINFULLY slow.
If you've got the memory, just read the entire file into memory and either reverse it or scan it backwards.
Another option would be Windows memory mapped files.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to read unicode (utf-8) / binary file line by line - c

A nice property of UTF-8 is that you do not need to decode in order to compare it. The order returned from strcmp will be the same whether you decode it first or not. So just read it as raw bytes and run strcmp.

fgets() can decode UTF-8 encoded files if you use Visual Studio 2005 and up. Change your code like this: infile = fopen(inname, "r, ccs=UTF-8");

just to settle the BOM argument. Here is a file from notepad [paul#paul-es5 tests]$ od -t x1 /mnt/hgfs/cdrive/test.txt 0000000 ef bb bf 61 0d 0a 62 0d 0a 63 0000012 with a BOM at the start Personally I dont think there should be a BOM (since its a byte format) but thats not the point

Related

Manual "encryption" only outputs 0's to file

Read 8 Bit Binary Numbers from text File in C

Get the length of each line in file with C and write in output file

How to Read/Write UTF8 text files in C?

Reading a text file backwards in C

Categories

Resources