How to Read/Write UTF8 text files in C? - c

i am trying to read UTF8 text from a text file, and then print some of it to another file. I am using Linux and gcc compiler. This is the code i am using:
#include <stdio.h>
#include <stdlib.h>
int main(){
FILE *fin;
FILE *fout;
int character;
fin=fopen("in.txt", "r");
fout=fopen("out.txt","w");
while((character=fgetc(fin))!=EOF){
putchar(character); // It displays the right character (UTF8) in the terminal
fprintf(fout,"%c ",character); // It displays weird characters in the file
}
fclose(fin);
fclose(fout);
printf("\nFile has been created...\n");
return 0;
}
It works for English characters for now.

Instead of
fprintf(fout,"%c ",character);
use
fprintf(fout,"%c",character);
The second fprintf() does not contain a space after %c which is what was causing out.txt to display weird characters. The reason is that fgetc() is retrieving a single byte (the same thing as an ASCII character), not a UTF-8 character. Since UTF-8 is also ASCII compatible, it will write English characters to the file just fine.
putchar(character) output the bytes sequentially without the extra space between every byte so the original UTF-8 sequence remained intact. To see what I'm talking about, try
while((character=fgetc(fin))!=EOF){
putchar(character);
printf(" "); // This mimics what you are doing when you write to out.txt
fprintf(fout,"%c ",character);
}
If you want to write UTF-8 characters with the space between them to out.txt, you would need to handle the variable length encoding of a UTF-8 character.
#include <stdio.h>
#include <stdlib.h>
/* The first byte of a UTF-8 character
* indicates how many bytes are in
* the character, so only check that
*/
int numberOfBytesInChar(unsigned char val) {
if (val < 128) {
return 1;
} else if (val < 224) {
return 2;
} else if (val < 240) {
return 3;
} else {
return 4;
}
}
int main(){
FILE *fin;
FILE *fout;
int character;
fin = fopen("in.txt", "r");
fout = fopen("out.txt","w");
while( (character = fgetc(fin)) != EOF) {
for (int i = 0; i < numberOfBytesInChar((unsigned char)character) - 1; i++) {
putchar(character);
fprintf(fout, "%c", character);
character = fgetc(fin);
}
putchar(character);
printf(" ");
fprintf(fout, "%c ", character);
}
fclose(fin);
fclose(fout);
printf("\nFile has been created...\n");
return 0;
}

This code worked for me:
/* fgetwc example */
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
int main ()
{
setlocale(LC_ALL, "en_US.UTF-8");
FILE * fin;
FILE * fout;
wint_t wc;
fin=fopen ("in.txt","r");
fout=fopen("out.txt","w");
while((wc=fgetwc(fin))!=WEOF){
// work with: "wc"
}
fclose(fin);
fclose(fout);
printf("File has been created...\n");
return 0;
}

If you do not wish to use the wide options, experiment with the following:
Read and write bytes, not characters.
Also known as, use binary, not text.
fgetc effectively gets a byte from a file, but if the byte is greater than 127, try treating it as a int instead of a char.
fputc, on the other hand, silently ignores putting a char > 127. It will work if you use an int rather than char as the input.
Also, in the open mode, try using binary, so try rb & wb rather than r & w

The C-style solution is very insightful, but if you'd consider using C++ the task becomes much more high level and it does not require you to have so much knowledge about utf-8 encoding. Consider the following:
#include<iostream>
#include<fstream>
int main(){
wifstream input { "in.txt" }
wofstream output { "out.txt" }
// Look out - this part is not portable to windows
locale utf8 {"en_us.UTF-8"};
input.imbue(utf8);
output.imbue(utf8);
wcout.imbue(utf8);
wchar_t c;
while(input >> noskipws >> c) {
wcout << c;
output << c;
}
return 0;
}

Related

Counting Character usage in text file? C

Hi,
I need to count the usage of alphabetical characters in some plain text file. This is what i have came with. Basically just run through the text file and compare each character with the ASCII value of specific searched character.
When I run it, all I can see is just the first printf() string and just error of terminated status when I close the console.
I do have a text.txt file in same folder as the .exe file but I can't see anything.
Not sure if just my syntax is bad or even semantics.
Thx for help! :-)
#include <stdio.h>
#include <stdlib.h>
#define ASCIIstart 65
#define ASCIIend 90
void main(){
FILE *fopen(), *fp;
int c;
unsigned int sum;
fp = fopen("text.txt","r");
printf("Characters found in text: \n");
for (int i = ASCIIstart; i <= ASCIIend; i++){
sum = 0;
c = toupper(getc(fp));
while (c != EOF){
if (c == i){
sum = sum++;
}
c = toupper(getc(fp));
}
if (sum > 0){
printf("%c: %u\n",i,sum);
}
}
fclose(fp);
}
Instead of looking up the entire file for each character, you could do
FILE *fp;
int c, sum[ASCIIend - ASCIIstart + 1]={0};
fp = fopen("file.txt,"r");
if(fp==NULL)
{
perror("Error");
return 1;
}
int i;
while( (c = toupper(getc(fp)))!= EOF)
{
if(c>=ASCIIstart && c<=ASCIIend)
{
sum[c-ASCIIstart]++;
}
}
for(i=ASCIIstart; i<=ASCIIend; ++i)
{
printf("\n%c: %d", i, sum[i-ASCIIstart]);
}
You must check the return value of fopen() to ensure that the file was successfully opened.
There's an array sum which holds the the number of occurrences of each character within the range denoted with ASCIIend and ASCIIstart macros.
The size of the array is just the number of characters whose number of occurrences is to be counted.
sum[c-ASCIIstart] is used because the difference between the ASCII value (if the encoding is indeed ASCII) of c and ASCIIstart would give the index associated with c.
I don't know what you meant with FILE *fopen(), fp; but fopen() is the name of a function in C used to open files.
And by
FILE *fopen(), *fp;
you gave a prototype of a function fopen().
But in stdio.h, there's already a prototype for fopen() like
FILE *fopen(const char *path, const char *mode);
yet no errors (if so) were shown because fopen() means that the function can have any number of arguments. Have a look here.
Had the return type of your FILE *fopen(); were not FILE * or if it were shown to other parameter types like int, you would definitely have got an error.
And, void main() is not considered good practice. Use int main() instead. Look here.
You can use a character array and parse the file contents with one time traversal and display the array count finally.
#include <stdio.h>
#include<ctype.h>
void main(){
FILE *fopen(), *fp;
int c;
fp = fopen("test.txt","r");
printf("Characters found in text: \n");
char charArr[26]= {0};
c = toupper(fgetc(fp));
while(c!=EOF) {
charArr[c-'A']=charArr[c-'A']+1;
c = toupper(fgetc(fp));
}
fclose(fp);
for(int i=0;i<26;i++){
printf("\nChar: %c | Count= %d ",i+65,charArr[i]);
}
}
Hope this helps!!
because after first time you are end of the file.
and your c = toupper(getc(fp)); returning -1 after that.
For counting just one character, you are reading the whole file and repeating this for each and every character. Instead, you can do:
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#define ASCIIstart 65
#define ASCIIend 90
int main(){
FILE *fp;
int c, i;
int alphabets[26] = {0};
fp = fopen("text.txt","r");
if (fp == NULL){
fprintf (stderr, "Failed to open file\n");
return -1;
}
while ((c = toupper(fgetc(fp))) != EOF){
if (c >= ASCIIstart && c <= ASCIIend)
alphabets[c - ASCIIstart]++;
}
fclose(fp);
fprintf(stdout, "Characters found in text: \n");
for (i = 0; i < 26; i++)
fprintf (stdout, "%c: %d\n", i+ASCIIstart, alphabets[i]);
return 0;
}
TLDR
Working with your code, your loops are inside-out.
I'll answer in pseudo-code to keep the concepts straightforward.
Right now you are doing this:
FOR LETTER = 'A' TO 'Z':
WHILE FILE HAS CHARACTERS
GET NEXT CHARACTER
IF CHARACTER == LETTER
ADD TO COUNT FOR CHAR
END IF
END WHILE
END FOR
The problem is you are running through the file with character 'A' and then reaching the end of file so nothing gets done for 'B'...'Z'
If you swapped this:
WHILE FILE HAS CHARACTERS
GET NEXT CHARACTER
FOR LETTER = 'A' TO 'Z'
IF LETTER = UCASE(CHARACTER)
ADD TO COUNT FOR LETTER
END IF
END FOR
END WHILE
Obviously doing 26 checks for each letter is too much so perhaps a better approach.
LET COUNTS = ARRAY(26)
WHILE FILE HAS CHARACTERS
CHARACTER := UCASE(CHARACTER)
IF CHARACTER >= 'A' AND CHARACTER <= 'Z'
LET INDEX = CHARACTER - 'A'
COUNTS[INDEX]++
ENDIF
END WHILE
You can translate the pseudo code to C as an exercise.
Rewind the pointer to the beginning of the file at the end of your for loop?
This has been posted before: Resetting pointer to the start of file
P.S. - maybe use an array for your output values : int charactercount[pow(2,sizeof(char))] so that you don't have to parse the file repeatedly?
edit: was missing pow()

Reading and writing into a file in c

I need to write into a file with uppercase some strings ,then to display on screen with lowercase. After that ,I need to write into file the new text (lowercase one). I write some code ,but it doesn't work. When I run it , my file seems to be intact and the convert to lowercase don't work
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
void main(void) {
int i;
char date;
char text[100];
FILE *file;
FILE *file1;
file = fopen("C:\\Users\\amzar\\Desktop\\PC\\Pregatire PC\\Pregatire PC\\file\\da.txt","r");
file1 = fopen("C:\\Users\\amzar\\Desktop\\PC\\Pregatire PC\\Pregatire PC\\file\\da.txt","w");
printf("\nSe citeste fisierul si se copiaza textul:\n ");
if(file) {
while ((date = getc(file)) != EOF) {
putchar(tolower(date));
for (i=0;i<27;i++) {
strcpy(text[i],date);
}
}
}
if (file1) {
for (i=0;i<27;i++)
fprintf(file1,"%c",text[i]);
}
}
There are several problems with your program.
First, getc() returns int, not char. This is necessary so that it can hold EOF, as this is not a valid char value. So you need to declare date as int.
When you fix this, you'll notice that the program ends immediately, because of the second problem. This is because you're using the same file for input and output. When you open the file in write mode, that empties the file, so there's nothing to read. You should wait until after you finish reading the file before you open it for output.
The third problem is this line:
strcpy(text[i],date);
The arguments to strcpy() must be strings, i.e. pointers to null-terminated arrays of char, but text[i] and date are char (single characters). Make sure you have compiler warnings enabled -- that line should have warned you about the incorrect argument types. To copy single characters, just use ordinary assignment:
text[i] = date;
But I'm not really sure what you intend with that loop that copies date into every text[i]. I suspect you want to copy each character you read into the next element of text, not into all of them.
Finally, when you were saving into text, you didn't save the lowercase version.
Here's a corrected program. I've also added a null terminator to text, and changed the second loop to check for that, instead of hard-coding the length 27.
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
void main(void) {
int i = 0;
int date;
char text[100];
FILE *file;
FILE *file1;
file = fopen("C:\\Users\\amzar\\Desktop\\PC\\Pregatire PC\\Pregatire PC\\file\\da.txt","r");
printf("\nSe citeste fisierul si se copiaza textul:\n ");
if(file) {
while ((date = getc(file)) != EOF) {
putchar(tolower(date));
text[i++] = tolower(date);
}
text[i] = '\0';
fclose(file);
} else {
printf("Can't open input file\n");
exit(1);
}
file1 = fopen("C:\\Users\\amzar\\Desktop\\PC\\Pregatire PC\\Pregatire PC\\file\\da.txt","w");
if (file1) {
for (i=0;text[i] != '\0';i++)
fprintf(file1,"%c",text[i]);
fclose(file1);
} else {
printf("Can't open output file\n");
exit(1);
}
}

Converting symbols to lowercase letters in c

I'm writing a program that encodes a file using xor and print the encrypted text into another file. It technically works, however the output contains several symbols rather than only lowercase characters. How would I tell the program to only print lowercase letters, and be able to decode it back?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char *args[]){
FILE *inFile, *outFile, *keyFile;
int key_count = 0;
int encrypt_byte;
char key[1000];
inFile = fopen("input.txt", "r");
outFile = fopen("output.txt", "w");
keyFile = fopen("key.txt", "r");
while((encrypt_byte = fgetc(inFile)) !=EOF)
{
fputc(encrypt_byte ^ key[key_count], outFile); //XORs
key_count++;
if(key_count == strlen(key)) //Reset the counter
key_count = 0;
}
printf("Complete!");
fclose(inFile);
fclose(outFile);
fclose(keyFile);
return(0);
}
Here is the output I get:
ÕââÐå朶è”ó
I just want it to only use lowercase letters
You can't. You either XOR all data of the file, or you don't. XOR-ing will result in non-printable characters.
What you can do though, is first XOR it and then encode it base64.
To get your original text/data back, do the reverse.
See also How do I base64 encode (decode) in C?
use the function tolower()
here there is an example:
#include<stdio.h>
#include<ctype.h>
int main()
{
int counter=0;
char mychar;
char str[]="TeSt THis seNTeNce.\n";
while (str[counter])
{
mychar=str[counter];
putchar (tolower(mychar));
counter++;
}
return 0;
}

Conversion of Image--->binary--->image using C

We are trying to convert an image into binary data and vice-versa for a project using C programming. All the other solutions we found on the net are either in C++ or Java. Here is the approach we tried:
Convert the image into a text file containing binary data. Each 8 characters corresponds to the character byte when the image is opened using a text editor.
Then we try to reconvert the binary data into its respective characters using a C program.
Then we open the result using Picasa Photoviewer. We get an invalid image.
How do we get back the original image? Here is the code we used to convert the image into a text file:
#include<stdio.h>
#include<conio.h>
void main()
{
clrscr();
FILE *fptr;
FILE *txt;
int c;
fptr=fopen("D:\\aa.bmp","r");
txt=fopen("D:\\test1.txt","w");
if(fptr==NULL)
{
printf("NOTHING In FILE");
fclose(fptr);
}
else
{
printf("success");
do
{
c=fgetc(fptr);
for(int i=0;i<=7;i++)
{
if(c&(1<<(7-i)))
{
fputc('1',txt);
}
else
{
fputc('0',txt);
}
}
// fprintf(txt,"\t");
}while(c!=EOF);
}
fclose(fptr);
fclose(txt);
printf("writing over");
getch();
}
Here is the code to convert the resulting text file to image file full of binary characters, i.e. a text file with only ones and zeroes.
#include<stdio.h>
#include<conio.h>
\\The following function converts the ones and zeroes in the text file into a character.
\\For example the text file may have the 8 consecutive characters '1','0','0','0','1','0','0','0'.
\\This converts it into the character equivalent of the binary \\value 10001000
char bytefromtext(char* text)
{
char result=0;
for(int i=0;i<8;i++)
{
if(text[i]=='1')
{
result |= (1<< (7-i) );
}
}
return result;
}
void main()
{
clrscr();
FILE *pfile;
FILE *image;
char buf[8];
char c;
int j=0;
image=fopen("D:\\aa2.bmp","w"); //open an empty .bmp file to
//write characters from the source image file
pfile=fopen("D:\\test1.txt","r");
if(pfile==NULL)
printf("error");
else
{
c=fgetc(pfile);
while(c!=EOF)
{
buf[j++]=c;
if(j==8)
{
fputc(bytefromtext(buf),image);
j=0;
}
c=fgetc(pfile);
}
fclose(pfile);
fclose(image);
}
getch();
}
We get an invalid image when the characters are written into the .bmp file. When we open this new file using a text editor and also the image file using a text editor we get the same characters.
Image to text file
fptr=fopen("D:\\aa.bmp","r");
The BMP file must be opened in binary mode ("rb") to ensure that the bytes values are read correctly. The "r" mode opens the file in text mode which may cause some characters to be converted, resulting in corrupted output.
For example, on Windows (or at least DOS), line endings will be converted from "\r\n" to "\n" and the character "\x1a" might be interpreted as and EOF indicator and truncate your input.
On UNIX-like systems, on the other hand, there is no difference.
do
{
c=fgetc(fptr);
for(int i=0;i<=7;i++)
{
/* ... */
}
// fprintf(txt,"\t");
}while(c!=EOF);
This loop is completely wrong. You need to check for EOF at the top of the loop. When fgetc() returns EOF, your code will take the EOF value (typically -1), and output the corresponding ones and zeroes, before exiting the loop. This will also corrupt your output.
Instead, you should do something like this:
while ((c = fgetc (fptr)) != EOF) {
{
/* ... */
}
If you're uncomfortable with assignment and comparison in the same expression, there is a fix for that:
while (1)
{
c = fgetc (fptr);
if (c == EOF)
break;
/* ... */
}
Note also that fgetc() also returns EOF on error. You should test for this (if (ferror (fptr))) and report the problem to the user.
fclose(fptr);
fclose(txt);
You should check the return value of fclose() and report any error back to the user, at least on the output stream. On some file systems, the last output will not be written to disc until the stream is closed, and any error writing it will be reported by fclose(). See "What are the reasons to check for error on close()?" for a illuminating tale of what can happen when you don't.
Text file to image
image=fopen("D:\\aa2.bmp","w"); //open an empty .bmp file to
You must use binary mode ("wb") as explained above.
char bytefromtext(char* text)
{
char result=0;
for(int i=0;i<8;i++)
{
if(text[i]=='1')
{
result |= (1<< (7-i) );
}
}
return result;
}
You should use unsigned char when dealing with binary data. Plain char is either signed or unsigned at the choice of the implementor (eg. compiler vendor).
If the value stored in result cannot be represented in a signed char (such as 1<<7), the result is implementation defined. This could theoretically corrupt your output. Although I think your code will probably work as you intended in most cases, you should still use unsigned char as a matter of principle.
(This assumes, of course, that char is not larger than 8 bits which is usually the case.)
char c;
/* ... */
c=fgetc(pfile);
while(c!=EOF)
{
/* ... */
c=fgetc(pfile);
}
This loop is wrong for another reason. If plain char happens to be unsigned, c will never compare equal to EOF which always has a negative value. You should use an int variable, test against EOF, and only then use the value as a character value.
fclose(pfile);
fclose(image);
You should check the return values as mentioned above.
Other issues
I also have a couple of other quibbles with your code.
In C, main() always returns int, and you should return an appropriate value to indicate success or failure. (This does not apply to a freestanding environment, eg. a C program running without an operating system.)
The comment section in the second program has backslashes instead of forward slashes. When posting code, you should always copy/paste it to avoid introducing new errors.
Check your write mode for image=fopen("D:\\aa2.bmp","w"); its not in binary, open it in "wb".
This is code that works good. Tryed on raspberryPi3 and gcc.
bmp to text
#include <stdio.h>
int main(int argc, char*argv[]){
FILE *ptr_bmp_in;
FILE *ptr_text_out;
int c;
ptr_bmp_in=fopen("panda_input.bmp","rb");
ptr_text_out=fopen("panda_to_text.txt","w");
if(!ptr_bmp_in)
{
printf("Unable to open file\n");
return 1;
}
while((c=fgetc(ptr_bmp_in)) != EOF)
{
for(int i=0;i<=7;i++)
{
if(c&(1<<(7-i)))
{
fputc('1',ptr_text_out);
}
else
{
fputc('0',ptr_text_out);
}
}
}
fclose(ptr_bmp_in);
fclose(ptr_text_out);
printf("Writing done\n");
return 0;
}
and text to bmp
#include <stdio.h>
char bytefromtext(unsigned char* text)
{
char result = 0;
for(int i=0;i<8;i++)
{
if(text[i]=='1')
{
result |= (1 << (7-i));
}
}
return result;
}
int main(int argc, char*argv[]){
FILE *ptr_txt_in;
FILE *ptr_bmp_out;
unsigned char buf[8];
int c;
int j = 0;
ptr_txt_in=fopen("panda_to_text.txt","r");
ptr_bmp_out=fopen("panda_output.bmp","wb");
if(!ptr_txt_in)
{
printf("Unable to open file\n");
return 1;
}
while((c=fgetc(ptr_txt_in)) != EOF)
{
buf[j++] = c;
if(j==8)
{
fputc(bytefromtext(buf),ptr_bmp_out);
j=0;
}
}
fclose(ptr_txt_in);
fclose(ptr_bmp_out);
printf("Writing done\n");
return 0;
}

Opening binary files in C

I'm trying to open a binary file and read the contents for a class assignment. Even after doing research, I'm having trouble getting anything to appear while attempting open and prints contents of a binary file. I'm not even sure what I should be getting, how to check that it's right but I know that nothing (which is what I'm currently getting) is bad. Here's the code I got from searching on this site
#include<stdio.h>
int main()
{
FILE *ptr_myfile;
char buf[8];
ptr_myfile = fopen("packets.1","rb");
if (!ptr_myfile)
{
printf("Unable to open file!");
return 1;
}
fread(buf, 1, 8, ptr_myfile);
printf("First Character: %c", buf[0]);
fclose(ptr_myfile);
return 0;
}
When this prints, I get "First Character: " with nothing else printed. Maybe it doesn't print normally in terminal? I'm not sure, any help would be greatly appreciated, thanks
If it's a binary file, it's very likely that its contents don't print particularly well as text (that's what makes a binary a binary file). Instead of printing as characters try printing as hexadecimal numbers:
#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
int main()
{
FILE *ptr_myfile;
char buf[8];
ptr_myfile = fopen("packets.1","rb");
if (!ptr_myfile)
{
printf("Unable to open file!");
return 1;
}
size_t rb;
do {
rb = fread(buf, 1, 8, ptr_myfile);
if( rb ) {
size_t i;
for(i = 0; i < rb; ++i) {
printf("%02x", (unsigned int)buf[i]);
}
printf("\n");
}
} while( rb );
fclose(ptr_myfile);
return 0;
}
First, you need to check how much data you have in the buffer. fread returns length; if it is zero, accessing buf[0] is not legal.
Not all characters are printable You can see what data you are getting if you print the character code of c, rather than c itself. Use %d for that.
size_t len = fread(buf, 1, 8, ptr_myfile);
if (len != 0) {
printf("First Character: '%c', code %d", buf[0], buf[0]);
} else {
printf("The file has no data\n");
}

Resources