Manual "encryption" only outputs 0's to file - c

My assignment (not homework, just a "try if you can do this" thing) is to use bit operations to encrypt and decrypt a .txt file.
This is the program. It successfully opens files for read/write but puts all 0's and spaces into the output.txt file instead of the expected "encrypted" text. I am guessing the issue comes from a fundamental misunderstanding of either data types or putc(). I understand that it outputs an unsigned char, but my professor said that an unsigned char is nothing but an unsigned int -- not sure if this is purely true or if it was a pedagogical simplification. Thanks a lot for any help.
#include <stdio.h>
#include <ctype.h>
#include <string.h>
#define NUMARG 3
#define INFILEARG 1
#define OUTFILEARG 2
int main(int argc, char *argv[]){
/* Function prototypes */
unsigned int encryptDecrypt(unsigned int x, unsigned int ed);
const char *get_filename_ext(const char *filename);
FILE *finp;
FILE *foutp;
//ed for encryption/decryption choice
unsigned int ed, c;
const char *ext;
//Check for errors in argument number and file opening.
if(argc != NUMARG){
printf("You have to put the input and output files after the
program name.\n");
return(1);
}
if( (finp = fopen(argv[INFILEARG], "r")) == NULL ){
printf("Couldn't open %s for reading.\n", argv[INFILEARG]);
return(1);
}
if( (foutp = fopen(argv[OUTFILEARG], "w")) == NULL){
printf("Couldn't open %s for writing.\n", argv[OUTFILEARG]);
return(1);
}
//Get and check file extension.
ext = get_filename_ext(argv[INFILEARG]);
if(strcmp(ext, "txt")){
printf("Input file is not a .txt file.\n");
return(1);
}
ext = get_filename_ext(argv[OUTFILEARG]);
if(strcmp(ext, "txt")){
printf("Output file is not a .txt file.\n");
return(1);
}
//Get command to encrypt or decrypt.
do{
printf("Enter e to encrypt, d to decrypt: ");
ed = getchar();
} while(ed != 'e' && ed != 'd');
//Send characters to output file.
while((c = getc(finp)) != EOF ){
putc(encryptDecrypt(c, ed), foutp);
}
// Close files.
if (fclose(finp) == EOF){
printf("Error closing input file.\n");
}
if (fclose(foutp) == EOF){
printf("Error closing output file.\n");
}
if ( ed == 'e'){
printf("Encrypted data written.\n");
} else {
printf("Data decrypted.\n");
}
return 0;
}
const char *get_filename_ext(const char *filename) {
const char *dot = strrchr(filename, '.');
if(!dot || dot == filename) return "";
return dot + 1;
}
unsigned int encryptDecrypt(unsigned int c, unsigned int ed){
if( ed == 'e' ){
printf("%d before operated on.\n", c);
c &= 134;
printf("%d after &134.\n", c);
c ^= 6;
printf("%d after ^6. \n", c);
c <<= 3;
printf("%d after <<3\n", c);
}
else {
c >>= 3;
c ^= 6;
c &= 134;
}
return c;
}
Output:
ZacBook:bitoperations $ cat input1.txt
Please encrypt this message.
ZacBook:bitoperations $ ./encrypt.o input1.txt output.txt
Enter e to encrypt, d to decrypt: e
80 before operated on.
0 after &134.
6 after ^6.
48 after <<3
108 before operated on.
4 after &134.
2 after ^6.
16 after <<3
[...Many more of these debug lines]
2 after &134.
4 after ^6.
32 after <<3
Encrypted data written.
ZacBook:bitoperations $ cat output.txt
00 0 00000 0 0
As you can see, the unsigned int is being operated on successfully. I believe the problem is with putc() but I have tried changing the type of c to char and int and neither have worked.

Your main problem is that &= is a lossy transformation: that is you lose data.
Ditto <<= and >>=, as both cause extreme 1 bits to be lost.
You'll have more luck sticking to XOR; at first at least. That's because x ^ y ^ y is x.
You can eliminate putc &c. by isolating the encryption / decryption processes from the data acquisition stages, and by hardcoding the inputs whilst your getting things working.

Related

Redirecting input file to Program, only reading one command correctly

My program runs correctly, If i manually input any commands it will read everything correctly as well and set it to the correct variables and then output it to a binary file.
But, when I redirect the input file in my command line, it only reads the Command used for the switch, but it does not read anything else in correctly.
for example
a7 < a7Input.txt
Sample of Input Text
r
1023
r
4393
c
3423
Systems Programming
MWF
3
68
c
3421
Systems Programming Recitation
TR
1
68
Enter one of the following actions or press CTRL-D to exit
C - Create a new course record
R - Read an existing course record
U - Update an existing course record
D - Delete an existing course record
This is Choice R
Please Enter a Course Number to Display
courseNumber: 0
courseName: e record
courseSched: urscourseName: 0
courseSize: -1630458206
YOU PICKED C for Create
Please Enter The Details of The Course
Enter a course number:
Enter the course schedule (MWF or TR):
Enter a Course Name:
Enter the course credit hours:
Enter Number of Students Enrolled:
Course Has Been Created!
Notice How it doesn't populate anything from the file, and I'm not entirely sure what is going on if when I manually input stuff, it works correctly.
Here is my code for reference
#include <stdio.h>
#include <string.h>
#include <assert.h>
#include <stdlib.h>
#include <stdarg.h>
typedef struct{
char courseName[64];
char courseSched[4];
unsigned int courseHours;
unsigned int courseSize;
} COURSE;
/* files */
FILE *pfileInputFile;
FILE *pFileDirect;
void proccessInputCommands(char *pszInputFile, char *pszDirectFile, char *argv[]);
void getFileNames(int argc, char *argv[], char **ppszDirectFileName, char **ppszInputTxtFileName);
int main(int argc, char *argv[])
{
char *pszDirectFile = NULL;
char *pszInputFile = NULL;
int rc;
int iCommandType; //Type of command
/* get the filenames and comand type from the command switches */
getFileNames(argc, argv, &pszDirectFile, &pszInputFile);
proccessInputCommands(pszInputFile, pszDirectFile, argv);
return 0;
}
void proccessInputCommands(char *pszInputFile, char *pszDirectFile, char *argv[])
{
COURSE course;
char szInputBuffer[100];
char cCommand;
long lRBA;
long lCourseNum;
char szRemaining[100];
int iScanfCnt;
int iWriteNew;
int rcFseek;
int rc;
// open the txt file for read
pfileInputFile = fopen(pszInputFile, "r");
if(pfileInputFile == NULL)
printf("ERROR\n");
// open the new direct data file for write binary
// if it already exists we simply update it
pFileDirect = fopen(pszDirectFile, "wb+");
if(pFileDirect == NULL)
printf("ERROR\n");
//get commands until EOF
//fgets returns null when EOF is reached
while(fgets(szInputBuffer, 100, pfileInputFile) != NULL)
{
if(szInputBuffer[0] == '\n')
continue;
printf("> %s",szInputBuffer);
iScanfCnt = sscanf(szInputBuffer, "%c\n %ld\n %99[^\n]\n"
, &cCommand
, &lCourseNum
, szRemaining);
if(iScanfCnt < 2)
{
printf("Error: Expected command and Course Number, found: %s\n"
, szInputBuffer);
continue;
}
switch(cCommand)
{
case 'c':
case 'C':
iScanfCnt = sscanf(szRemaining, "%64[^\n] %4s %d %d"
, course.courseName
, course.courseSched
, &course.courseHours
, &course.courseSize);
// check for bad input.
if(iScanfCnt < 4)
printf("ERROR\n");
lRBA = lCourseNum*sizeof(COURSE);
rcFseek = fseek(pFileDirect, lRBA, SEEK_SET);
assert(rcFseek == 0);
// write it to the direct file
iWriteNew = fwrite(&course
, sizeof(COURSE)
, 1L
, pFileDirect);
assert(iWriteNew == 1);
break;
case 'r':
case 'R':
lRBA = lCourseNum*sizeof(COURSE);
rcFseek = fseek(pFileDirect, lRBA, SEEK_SET);
assert(rcFseek == 0);
//print the informatio at the RBA
rc = fread(&course, sizeof(course), 1L, pFileDirect);
if(rc == 1)
printf("%-64s %-4s %5d %5d\n"
, course.courseName
, course.courseSched
, course.courseHours
, course.courseSize);
else
printf("Course Number %ld not found for CBA %ld\n",lCourseNum, lRBA);
break;
default:
printf("unknown command\n");
}
}
//close the file
fclose(pfileInputFile);
fclose(pFileDirect);
}
void getFileNames(int argc, char *argv[], char **ppszDirectFileName, char **ppszInputTxtFileName)
{
int i;
// If there aren't any arguments, show the usage
if (argc <= 1)
printf("No Arguments \n");
// Examine each of the command arguments other than the name of the program.
for (i = 1; i < argc; i++)
{
if(i == 1)
{
*ppszInputTxtFileName = argv[i];
}
if(i == 2)
{
*ppszDirectFileName = argv[i];
}
}
}

Converting Lower Case Characters in a Text File to Upper Case Letters and Vice Versa

I have a text file that contains the following text, without a new line character...
Hello World
I would like to convert the lower case characters to upper case, and vice versa, so that the same text file would end up with the following text...
hELLOW wORLD
Unfortuntately, when I run my code, it goes into an endless loop. When I step through the code, I see that fseek() goes back one byte for the first loop, as expected, but it goes back two bytes for the second and subsequent loops. I don't understand why it goes back two bytes instead of one. Why is this the case? Can someone please help?
Here's my code...
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
int main()
{
FILE *fp;
int ch;
long offset;
fp = fopen("c:\\users\\domenic\\desktop\\test.txt", "r+");
if (fp == NULL)
{
printf("error: unable to open file\n");
exit(1);
}
offset = ftell(fp);
while (1)
{
ch = fgetc(fp);
if (ch == EOF)
break;
if (isupper(ch))
{
fseek(fp, offset, SEEK_SET);
fputc(tolower(ch), fp);
}
else if (islower(ch))
{
fseek(fp, offset, SEEK_SET);
fputc(toupper(ch), fp);
}
offset = ftell(fp);
}
fclose(fp);
return 0;
}
If I understand you just want to change upper to lower and lower to upper throughout a file, you may be making it a bit harder on yourself than it need be.
Before we look at an approach to make things a bit easier, let's talk about avoiding magic numbers and hardcoded paths within your code. C provides a definition for main that allows you to provide arguments to your code to avoid hardcoding values (such as file/path names) -- use them. The proper invocation of main with arguments is:
int main (int argc, char *argv[])
(or you will see the equivalent int main (int argc, char **argv))
The invocation without arguments is int main (void).
Now on to the question at hand. As mentioned in my comment, when dealing with ASCII, the bit that controls the case is the 6th-bit -- and from the discussion, if you are dealing with EBCDIC, the *case-bitis the 7th-bit. As #chux pointed out both can be handled seamlessly by determining the appropriate bitA ^ afor both (the result is32, e.g.(1 << 5)for ASCII, and64or(1 << 6)for EBCDIC. To toggle any bit on/off you simply XOR the *case-bit* with the current character(A-Za-z). So far any character'c', you wish to toggle the case of, you simply XOR it withA ^ a`, e.g.
if (('A' <= c && c <= 'Z') || ('a' <= c && c <= 'z'))
c ^= A ^ a;
If c was uppercase, it's now lowercase, and vice-versa.
To do that for an entire file, taking the filename to convert as the first argument to the program (or reading from stdin by default if no argument is given) and outputting the resulting case-converted to stdout, you could do something as simple as the following:
#include <stdio.h>
int main (int argc, char **argv) {
int c;
FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;
if (!fp) { /* validate file open for reading */
fprintf (stderr, "error: file open failed '%s'.\n", argv[1]);
return 1;
}
while ((c = fgetc(fp)) != EOF) /* read each char */
/* is it a letter ? */
if (('A' <= c && c <= 'Z') || ('a' <= c && c <= 'z'))
putchar (c ^ ('A' ^ 'a')); /* toggle case */
else
putchar (c); /* just output */
if (fp != stdin) fclose (fp); /* close file if not stdin */
return 0;
}
Example Input
$ cat dat/captnjack.txt
This is a tale
Of Captain Jack Sparrow
A Pirate So Brave
On the Seven Seas.
Example Use/Output
$ ./bin/case_toggle < dat/captnjack.txt
tHIS IS A TALE
oF cAPTAIN jACK sPARROW
a pIRATE sO bRAVE
oN THE sEVEN sEAS.
If you want to write the output to a new file, simply redirect the output, e.g.
$ ./bin/case_toggle < dat/captnjack.txt > dat/captnjack_toggled.txt
Which would write the case-toggled output to dat/captnjack_toggled.txt.
Look things over and let me know if you have further questions.
To start off, fputc does not delete or rather act like an "insert" after you use fseek to go back 1 character.
In this case you are:
getting a char
fseek to before the char
placing a upper/lowercase char before the char found in step 2
setting new offset to right after your new character.
You probably get alot of hEEEEEEEEEEEEEEEEEE in your text file after exiting your endless loop?
To fix this I would create a temporary new file.. something like this:
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
int main()
{
FILE *fp, *new_f;
int ch;
long offset;
fp = fopen("test.txt", "r+");
new_f = fopen("test2.txt", "w" );
if ( fp == NULL || new_f == NULL )
{
printf("error: unable to open file\n");
exit(1);
}
offset = ftell(fp);
while (1)
{
ch = fgetc(fp);
if (ch == EOF)
break;
if( !isalpha( ch ) )
{
fputc( ch, new_f );
}
else if (isupper(ch))
{
fputc(tolower(ch), new_f );
}
else if (islower(ch))
{
fputc(toupper(ch), new_f);
}
}
fclose( fp );
fclose( new_f );
return 0;
}

Read 8 Bit Binary Numbers from text File in C

I'm trying to read from a text file in C that contains a list of 8 bit binary numbers to be used in another function.
The text file is formatted like:
01101101
10110110
10101101
01001111
11010010
00010111
00101011
Ect. . .
Heres kinda what i was trying to do
Pseudo code
void bincalc(char 8_bit_num){
//does stuff
}
int main()
{
FILE* f = fopen("test.txt", "r");
int n = 0, i = 0;
while( fscanf(f, "%d ", &n) > 0 ) // parse %d followed by a new line or space
{
bincalc(n);
}
fclose(f);
}
I think i'm on the right track, however any help is appreciated.
This is not the standard track to do it, but it ok. You are scanning the file reading ints so it will read the string and interpret them as decimal number, which you should in turn convert to the corresponding binary, i.e. converting 1010 decimal to 2^3+2^1=9 decimal. This is of course possible, you just need to transform powers of ten to powers of 2 (1.10^3+0.10^2+1.10^1+0.10^0 to 1.2^3+0.2^2+1.2^1+0.2^0). Be careful that this work with 8-bits numbers, but not with too huge ones (16-bits will not).
The more common way is to read the strings and decode the strings directly, or read char by char and make an incremental conversion.
If you want to have a number at the end, I would suggest something like the following:
int i;
char buf[10], val;
FILE* f = fopen("test.txt", "r");
while( ! feof(f) ) {
val = 0;
fscanf(f, "%s", buf);
for( i=0; i<8; i++ )
val = (val << 1) | (buf[i]-48);
/* val is a 8 bit number now representing the binary digits */
bincalc(val);
}
fclose(f);
This is just a short snippet to illustrate the idea and does not catch all corner cases regarding the file or buffer handling. I hope it will help anyway.
I think this is a possible solution, i made it simple so you can understand and adapt to your needs. You only need to repeat this for each line.
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
FILE *file =fopen("data","r");
char *result=calloc(1,sizeof(char));
char line[8];
int i=0;
for(;i<8;i++)
{
char get = (char)getc(file);
if(get == '0')
*result <<= 1;
else if(get == '1')
*result = ((*result<<1)|0x1) ;
}
printf("->%c\n",*result);
return 0;
}
I don't know of a way to specify binary format for fscanf() but you can convert a binary string like this.
#include <stdio.h>
#include <stdlib.h>
void fatal(char *msg) {
printf("%s\n", msg);
exit (1);
}
unsigned binstr2int(char *str) {
int i = 0;
while (*str != '\0' && *str != '\n') { // string term or newline?
if (*str != '0' && *str != '1') // binary digit?
fatal ("Not a binary digit\n");
i = i * 2 + (*str++ & 1);
}
return i;
}
int main(void) {
unsigned x;
char inp[100];
FILE *fp;
if ((fp = fopen("test.txt", "r")) == NULL)
fatal("Unable to open file");
while (fgets(inp, 99, fp) != NULL) {
x = binstr2int(inp);
printf ("%X\n", x);
}
fclose (fp);
return 0;
}
File input
01101101
10110110
10101101
01001111
11010010
00010111
00101011
Program output
6D
B6
AD
4F
D2
17
2B
Read the file line by line, then use strtol() with base 2.
Or, if the file is guaranteed to be not long, you could also read the full contents in (dynamically allocated) memory and use the endptr parameter.

Not reading from stdin properly

I'm trying to mimic the behavior of the unix utility cat, but when I call a command of the form:
cat file1 - file2 - file3
My program will output file1 correctly, then read in from stdin, then when I press EOF, it will print file 2 then file 3, without reading from stdin for the second time.
Why might this be?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define ASCII_LENGTH 255
int printfile(FILE *source, int N);
int main(int argc, char *argv[])
{
int currentarg = 1; //the argument number currently being processed
FILE *input_file;
//if there are no arguments, dog reads from standard input
if(argc == 1 || currentarg == argc)
{
input_file = stdin;
printfile(input_file,0);
}
else
{
int i;
for(i = currentarg; i < argc; i++)
{
printf("%d %s\n",i,argv[i]);
//if file is a single dash, dog reads from standard input
if(strcmp(argv[i],"-") == 0)
{
input_file = stdin;
printfile(input_file,0);
fflush(stdin);
fclose(stdin);
clearerr(stdin);
}
else if ((input_file = fopen(argv[i], "r")) == NULL)
{
fprintf(stderr, "%s: %s: No such file or directory\n", argv[0], argv[i]);
return 1;
}
else
{
printfile(input_file,0);
fflush(input_file);
fclose(input_file);
clearerr(input_file);
}
}
}
return 0;
}
int printfile(FILE *source, int N)
{
//used to print characters of a file to the screen
//characters can be shifted by some number N (between 0 and 25 inclusive)
char c;
while((c = fgetc(source)) != EOF)
{
fputc((c+N)%ASCII_LENGTH,stdout);
}
printf("***** %c %d",c,c==EOF);
return 0;
}
For one thing, you can't expect to be able to read from stdin after you've closed it:
fclose(stdin);
fflush(stdin); is undefined behaviour, as is fflush on all files open only for input. That's sort of like flushing the toilet and expecting the waste to come out of the bowl, because fflush is only defined for files open for output! I would suggest something like for (int c = fgetc(stdin); c >= 0 && c != '\n'; c = fgetc(stdin)); if you wish to discard the remainder of a line.
Furthermore, fgetc returns int for a reason: Inside the int will be an unsigned char value or EOF. c should be an int, not a char. EOF isn't a character! It's a negative int value. This differentiates it from any possible characters, because successful calls to fgetc will only return a positive integer rather than a negative EOF. fputc expects input in the form of an unsigned char value. char isn't required to be unsigned. Providing your fgetc call is successful and you store the return value into an int, that int should be safe to pass on to fputc.

How to read unicode (utf-8) / binary file line by line

Hi programmers,
I want read line by line a Unicode (UTF-8) text file created by Notepad, i don't want display the Unicode string in the screen, i want just read and compare the strings!.
This code read ANSI file line by line, and compare the strings
What i want
Read test_ansi.txt line by line
if the line = "b" print "YES!"
else print "NO!"
read_ansi_line_by_line.c
#include <stdio.h>
int main()
{
char *inname = "test_ansi.txt";
FILE *infile;
char line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */
char line_number;
infile = fopen(inname, "r");
if (!infile) {
printf("\nfile '%s' not found\n", inname);
return 0;
}
printf("\n%s\n\n", inname);
line_number = 0;
while (fgets(line_buffer, sizeof(line_buffer), infile)) {
++line_number;
/* note that the newline is in the buffer */
if (strcmp("b\n", line_buffer) == 0 ){
printf("%d: YES!\n", line_number);
}else{
printf("%d: NO!\n", line_number,line_buffer);
}
}
printf("\n\nTotal: %d\n", line_number);
return 0;
}
test_ansi.txt
a
b
c
Compiling
gcc -o read_ansi_line_by_line read_ansi_line_by_line.c
Output
test_ansi.txt
1: NO!
2: YES!
3: NO!
Total: 3
Now i need read Unicode (UTF-8) file created by Notepad, after more than 6 months i don't found any good code/library in C can read file coded in UTF-8!, i don't know exactly why but i think the standard C don't support Unicode!
Reading Unicode binary file its OK!, but the probleme is the binary file most be already created in binary mode!, that mean if we want read a Unicode (UTF-8) file created by Notepad we need to translate it from UTF-8 file to BINARY file!
This code write Unicode string to a binary file, NOTE the C file is coded in UTF-8 and compiled by GCC
What i want
Write the Unicode char "ب" to test_bin.dat
create_bin.c
#define UNICODE
#ifdef UNICODE
#define _UNICODE
#else
#define _MBCS
#endif
#include <stdio.h>
#include <wchar.h>
int main()
{
/*Data to be stored in file*/
wchar_t line_buffer[BUFSIZ]=L"ب";
/*Opening file for writing in binary mode*/
FILE *infile=fopen("test_bin.dat","wb");
/*Writing data to file*/
fwrite(line_buffer, 1, 13, infile);
/*Closing File*/
fclose(infile);
return 0;
}
Compiling
gcc -o create_bin create_bin.c
Output
create test_bin.dat
Now i want read the binary file line by line and compare!
What i want
Read test_bin.dat line by line
if the line = "ب" print "YES!"
else print "NO!"
read_bin_line_by_line.c
#define UNICODE
#ifdef UNICODE
#define _UNICODE
#else
#define _MBCS
#endif
#include <stdio.h>
#include <wchar.h>
int main()
{
wchar_t *inname = L"test_bin.dat";
FILE *infile;
wchar_t line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */
infile = _wfopen(inname,L"rb");
if (!infile) {
wprintf(L"\nfile '%s' not found\n", inname);
return 0;
}
wprintf(L"\n%s\n\n", inname);
/*Reading data from file into temporary buffer*/
while (fread(line_buffer,1,13,infile)) {
/* note that the newline is in the buffer */
if ( wcscmp ( L"ب" , line_buffer ) == 0 ){
wprintf(L"YES!\n");
}else{
wprintf(L"NO!\n", line_buffer);
}
}
/*Closing File*/
fclose(infile);
return 0;
}
Output
test_bin.dat
YES!
THE PROBLEM
This method is VERY LONG! and NOT POWERFUL (i m beginner in software engineering)
Please any one know how to read Unicode file ? (i know its not easy!)
Please any one know how to convert Unicode file to Binary file ? (simple method)
Please any one know how to read Unicode file in binary mode ? (i m not sure)
Thank You.
A nice property of UTF-8 is that you do not need to decode in order to compare it. The order returned from strcmp will be the same whether you decode it first or not. So just read it as raw bytes and run strcmp.
I found a solution to my problem, and I would like to share the solution to any one interested in reading UTF-8 file in C99.
void ReadUTF8(FILE* fp)
{
unsigned char iobuf[255] = {0};
while( fgets((char*)iobuf, sizeof(iobuf), fp) )
{
size_t len = strlen((char *)iobuf);
if(len > 1 && iobuf[len-1] == '\n')
iobuf[len-1] = 0;
len = strlen((char *)iobuf);
printf("(%d) \"%s\" ", len, iobuf);
if( iobuf[0] == '\n' )
printf("Yes\n");
else
printf("No\n");
}
}
void ReadUTF16BE(FILE* fp)
{
}
void ReadUTF16LE(FILE* fp)
{
}
int main()
{
FILE* fp = fopen("test_utf8.txt", "r");
if( fp != NULL)
{
// see http://en.wikipedia.org/wiki/Byte-order_mark for explaination of the BOM
// encoding
unsigned char b[3] = {0};
fread(b,1,2, fp);
if( b[0] == 0xEF && b[1] == 0xBB)
{
fread(b,1,1,fp); // 0xBF
ReadUTF8(fp);
}
else if( b[0] == 0xFE && b[1] == 0xFF)
{
ReadUTF16BE(fp);
}
else if( b[0] == 0 && b[1] == 0)
{
fread(b,1,2,fp);
if( b[0] == 0xFE && b[1] == 0xFF)
ReadUTF16LE(fp);
}
else
{
// we don't know what kind of file it is, so assume its standard
// ascii with no BOM encoding
rewind(fp);
ReadUTF8(fp);
}
}
fclose(fp);
}
fgets() can decode UTF-8 encoded files if you use Visual Studio 2005 and up. Change your code like this:
infile = fopen(inname, "r, ccs=UTF-8");
I know I am bad... but you don't even take under consideration BOM! Most examples here will fail.
EDIT:
Byte Order Marks are a few bytes at the beginnig of the file, which can be used to identify the encoding of the file. Some editors add them, and many times they just break things in faboulous ways (I remember fighting a PHP headers problems for several minutes because of this issue).
Some RTFM:
http://en.wikipedia.org/wiki/Byte_order_mark
http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx
What is XML BOM and how do I detect it?
In this article a coding and decoding routine is written and
it is explained how the unicode is encoded:
http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451/
It can be easily adjusted to C.
Simply encode your ANSI or decode the UTF-8 String and make a byte
compare
EDIT: After the OP said that it is too hard to rewrite the function from C++
here a template:
What is needed:
+ Free the allocated memory (or wait till the process ends or ignore it)
+ Add the 4 byte functions
+ Tell me that short and int is not guaranteed to be 2 and 4 bytes long (I know, but
C is really stupid !) and finally
+ Find some other errors
#include <stdlib.h>
#include <string.h>
#define MASKBITS 0x3F
#define MASKBYTE 0x80
#define MASK2BYTES 0xC0
#define MASK3BYTES 0xE0
#define MASK4BYTES 0xF0
#define MASK5BYTES 0xF8
#define MASK6BYTES 0xFC
char* UTF8Encode2BytesUnicode(unsigned short* input)
{
int size = 0,
cindex = 0;
while (input[size] != 0)
size++;
// Reserve enough place; The amount of
char* result = (char*) malloc(size);
for (int i=0; i<size; i++)
{
// 0xxxxxxx
if(input[i] < 0x80)
{
result[cindex++] = ((char) input[i]);
}
// 110xxxxx 10xxxxxx
else if(input[i] < 0x800)
{
result[cindex++] = ((char)(MASK2BYTES | input[i] >> 6));
result[cindex++] = ((char)(MASKBYTE | input[i] & MASKBITS));
}
// 1110xxxx 10xxxxxx 10xxxxxx
else if(input[i] < 0x10000)
{
result[cindex++] = ((char)(MASK3BYTES | input[i] >> 12));
result[cindex++] = ((char)(MASKBYTE | input[i] >> 6 & MASKBITS));
result[cindex++] = ((char)(MASKBYTE | input[i] & MASKBITS));
}
}
}
wchar_t* UTF8Decode2BytesUnicode(char* input)
{
int size = strlen(input);
wchar_t* result = (wchar_t*) malloc(size*sizeof(wchar_t));
int rindex = 0,
windex = 0;
while (rindex < size)
{
wchar_t ch;
// 1110xxxx 10xxxxxx 10xxxxxx
if((input[rindex] & MASK3BYTES) == MASK3BYTES)
{
ch = ((input[rindex] & 0x0F) << 12) | (
(input[rindex+1] & MASKBITS) << 6)
| (input[rindex+2] & MASKBITS);
rindex += 3;
}
// 110xxxxx 10xxxxxx
else if((input[rindex] & MASK2BYTES) == MASK2BYTES)
{
ch = ((input[rindex] & 0x1F) << 6) | (input[rindex+1] & MASKBITS);
rindex += 2;
}
// 0xxxxxxx
else if(input[rindex] < MASKBYTE)
{
ch = input[rindex];
rindex += 1;
}
result[windex] = ch;
}
}
char* getUnicodeToUTF8(wchar_t* myString) {
int size = sizeof(wchar_t);
if (size == 1)
return (char*) myString;
else if (size == 2)
return UTF8Encode2BytesUnicode((unsigned short*) myString);
else
return UTF8Encode4BytesUnicode((unsigned int*) myString);
}
just to settle the BOM argument. Here is a file from notepad
[paul#paul-es5 tests]$ od -t x1 /mnt/hgfs/cdrive/test.txt
0000000 ef bb bf 61 0d 0a 62 0d 0a 63
0000012
with a BOM at the start
Personally I dont think there should be a BOM (since its a byte format) but thats not the point

Resources