Reading CJK characters from an input file in C - c

I have a text file which can contain a mix of Chinese, Japanese, Korean (CJK) and English characters. I have to validate the file for English characters. The file can be allowed to contain CJK characters only when a line begins with the '$' character, which represents a comment in my text file. Searching through the net, I found out that I can use fgetws() and the wchar_t type to read wide chars.
Q1) But I am wondering how CJK characters would be stored in my text file - what byte order etc.
Q2) How can I loop through CJK characters. Since Unicode characters can have 1 to 6 bytes, I cannot use i++.
Any help would be appreciated.
Thanks a lot.

You need to read the UTF-8 file as a sequence of UTF-32 codepoints. For example:
std::shared_ptr<FILE> f(fopen(filename, "r"), fclose);
uint32_t c = 0;
while (utf8_read(f.get(), c))
{
if (is_english_char(c))
...
else if (is_cjk_char(c))
...
else
...
}
Where utf8_read has the signature:
bool utf8_read(FILE *f, uint32_t &c);
Now, utf8_read may read 1-4 bytes depending on the value of the first byte. See http://en.wikipedia.org/wiki/UTF-8, google for an algorithm or use a library function already available to you.
With the UTF-32 codepoint, you can now check ranges. For English, you can check if it is ASCII (c < 0x7F) or if it is a Latin character (Including support for accented characters for imported words from e.g. French). You may also want to exclude non-printable control characters (e.g. 0x01).
For the Latin and/or CJK character checks, you can check if the character is in a given code block (see http://www.unicode.org/Public/UNIDATA/Blocks.txt for the codepoint ranges). This is the simplest approach.
If you are using a library with Unicode support that has writing script detection (e.g. the glib library), you can use the script type to detect the characters. Alternatively, you can get the data from http://www.unicode.org/Public/UNIDATA/Scripts.txt:
Name : Code : Language(s)
=========:===========:========================================================
Common : Zyyy : general punctuation / symbol characters
Latin : Latn : Latin languages (English, German, French, Spanish, ...)
Han : Hans/Hant : Chinese characters (Chinese, Japanese)
Hiragana : Hira : Japanese
Katakana : Kana : Japanese
Hangul : Hang : Korean
NOTE: The script codes come from http://www.iana.org/assignments/language-subtag-registry (Type == 'script').

I am pasting a sample program to illustrate wchar_t handling. Hope it helps someone.
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
#define BUFLEN 1024
int main() {
wchar_t *wmessage=L"Lets- beginめん(下) 震災後、保存-食で-脚光-(経済ナビゲーター)-lets- end";
wchar_t warray[BUFLEN + 1];
wchar_t a = L'z';
int i=0;
FILE *fp;
wchar_t *token = L"-";
wchar_t *state;
wchar_t *ptr;
setlocale(LC_ALL, "");
/* FIle in current dirrctory containing CJK chars */
fp = fopen("input", "r");
if (fp == NULL) {
printf("%s\n", "Cannot open file!!!");
return (-1);
}
fgetws(warray, BUFLEN, fp);
wprintf(L"\n *********************START reading from file*******************************\n");
wprintf(L"%ls\n",warray);
wprintf(L"\n*********************END reading from file*******************************\n");
fclose(fp);
wprintf(L"printing character %lc = <0x%x>\n", a, a);
wprintf(L"\n*********************START Checking string for Japanese*******************************\n");
for(i=0;wmessage[i] != '\0';i++) {
if (wmessage[i] > 0x7F) {
wprintf(L"\n This is non-ASCII <0x%x> <%lc>", wmessage[i], wmessage[i]);
} else {
wprintf(L"\n This is ASCII <0x%x> <%lc>", wmessage[i], wmessage[i]);
}
}
wprintf(L"\n*********************END Checking string for Japanese*******************************\n");
wprintf(L"\n*********************START Tokenizing******************************\n");
state = wcstok(warray, token, &ptr);
while (state != NULL) {
wprintf(L"\n %ls", state);
state = wcstok(NULL, token, &ptr);
}
wprintf(L"\n*********************END Tokenizing******************************\n");
return 0;
}

You need to understand UTF-8 and use some UTF8 handling library (or code your own). FYI, Glib (from GTK) has UTF-8 handling functions, which are able to deal with variable-length UTF-8 chars & strings. There are other UTF-8 libraries e.g. iconv - inside GNU libc - and ICU and many others.
UTF-8 does define the byte order and content of multi-byte UTF8 characters, e.g. Chinese ones.

Related

Unable to print the character 'à' with the printf function in C

I would like to understand why I can print the character 'à' with the functions fopen and fgetc when I read a .txt file but I can't assign it to a char variable and print it with the printf function.
When I read the file texte.txt, the output is:
(Here is a letter that we often use in French: à)
The letter 'à' is correctly read by the fgetc function and assigned to the char c variable
See the code below:
int main() {
FILE *fp;
fp=fopen("texte.txt", "r");
if (fp==NULL) {
printf("erreur fopen");
return 1;
}
char c = fgetc(fp);
while(c != EOF) {
printf("%c", c);
c = fgetc(fp);
}
printf("\n");
return 0;
}
But now if I try to assign the 'à' character to a char variable, I get an error!
See the code below:
int main() {
char myChar = 'à';
printf("myChar is: %c\n", myChar);
return 0;
}
ERROR:
./main.c:26:15: error: character too large for enclosing character literal type
char myChar = 'à';
My knowledge in C is very insufficient, and I can't find an answer anywhere
To print à you can use wide character (or wide string):
#include <wchar.h> // wchar_t
#include <stdio.h>
#include <locale.h> // setlocale LC_ALL
int main() {
setlocale(LC_ALL, "");
wchar_t a = L'à';
printf("%lc\n", a);
}
In short: Characters have encoding. Program "locale" chooses what encoding is used by the standard library functions. A wide character represents a locale-agnostic character, a character that is "able" to be converted to/from any locale. setlocale set's your program locale to the locale of your terminal. This is needed so that printf knows how to convert wide character à to the encoding of your terminal. L in front of a character or string makes it a wide. On Linux, wide characters are in UTF-32.
Handling encodings might be hard. I can point to: https://en.wikipedia.org/wiki/Character_encoding , https://en.wikipedia.org/wiki/ASCII , https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html , https://en.cppreference.com/w/cpp/locale , https://en.wikipedia.org/wiki/Unicode .
You can encode a multibyte string straight in your source code and output. This will work only if your compiler generates code for the multibyte string in the same encoding as your terminal works with. If you change your terminal encoding, or tell your compiler to use a different encoding, it may fail. On Linux, UTF-8 is everywhere, compilers generate UTF-8 string and terminals understand UTF-8.
const char *str = "à";
printf("%s\n", str);

Why can't I print the decimal value of a extended ASCII char like 'Ç'? in C

First, in this C project we have some conditions as far as writing code: I can´t declare a variable and attribute a value to it on the same line of code and we are only allowed to use while loops. Also, I'm using Ubuntu for reference.
I want to print the decimal ASCII value, character by character, of a string passed to the program. For e.g. if the input is "rose", the program correctly prints 114 111 115 101. But when I try to print the decimal value of a char like a 'Ç', the first char of the extended ASCII table, the program weirdly prints -61 -121. Here is the code:
int main (int argc, char **argv)
{
int i;
i = 0;
if (argc == 2)
{
while (argv[1][i] != '\0')
{
printf ("%i ", argv[1][i]);
i++;
}
}
}
I did some research and found that i should try unsigned char argv instead of char, like this:
int main (int argc, unsigned char **argv)
{
int i;
i = 0;
if (argc == 2)
{
while (argv[1][i] != '\0')
{
printf("%i ", argv[1][i]);
i++;
}
}
}
In this case, I run the program with a 'Ç' and the output is 195 135 (still wrong).
How can I make this program print the right ASCII decimal value of a char from the extended ASSCCI table, in this case a "Ç" should be a 128.
Thank you!!
Your platform is using UTF-8 Encoding.
Unicode Latin Capital Letter C with Cedilla (U+00C7) "Ç" encodes to 0xC3 0x87 in UTF-8.
In turn those bytes in decimal are 195 and 135 which you see in output.
Remember UTF-8 is a multi-byte encoding for characters outside basic ASCII (0 thru 127).
That character is code-point 128 in extended ASCII but UTF-8 diverges from Extend ASCII in that range.
You may find there's tools on your platform to convert that to extended ASCII but I suspect you don't want to do that and should work with the encoding supported by your platform (which I am sure is UTF-8).
It's Unicode Code Point 199 so unless you have a specific application for Extended ASCII you'll probably just make things worse by converting to it. That's not least because it's a much smaller set of characters than Unicode.
Here's some information for Unicode Latin Capital Letter C with Cedilla including the UTF-8 Encoding: https://www.fileformat.info/info/unicode/char/00C7/index.htm
There are various ways of representing non-ASCII characters, such as Ç. Your question suggests you're familiar with 8-bit character sets such as ISO-8859, where in several of its variants Ç does indeed have code 199. (That is, if your computer were set up to use ISO-8859, your program probably would have worked, although it might have printed -57 instead of 199.)
But these days, more and more systems use Unicode, which they typically encode using a particular multibyte encoding, UTF-8.
In C, one way to extract wide characters from a multibyte character string is the function mbtowc. Here is a modification of your program, using this function:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>
#include <locale.h>
int main (int argc, char **argv)
{
setlocale(LC_CTYPE, "");
if (argc == 2)
{
char *p = argv[1];
int n;
wchar_t wc;
while((n = mbtowc(&wc, p, strlen(p))) > 0)
{
printf ("%lc: %d (%d)\n", wc, wc, n);
p += n;
}
}
}
You give mbtowc a pointer to the multibyte encoding of one or more multibyte characters, and it converts one of them, returning it via its first argument — here, into the variable wc. It returns the number of multibyte characters it used, or 0 if it encountered the end of the string.
When I run this program on the string abÇd, it prints
a: 97 (1)
b: 98 (1)
Ç: 199 (2)
d: 100 (1)
This shows that in Unicode (just like 8859-1), Ç has the code 199, but it takes two bytes to encode it.
Under Linux, at least, the C library supports potentially multiple multibyte encodings, not just UTF-8. It decides which encoding to use based on the current "locale", which is usually part fo the environment, literally governed by an environment variable such as $LANG. That's what the call setlocale(LC_CTYPE, "") is for: it tells the C library to pay attention to the environment to select a locale for the program's functions, like mbtowc, to use.
Unicode is of course huge, encoding thousands and thousands of characters. Here's the output of the modified version of your program on the string "abΣ∫😊":
a: 97 (1)
b: 98 (1)
Σ: 931 (2)
∫: 8747 (3)
😊: 128522 (4)
Emoji like 😊 typically take four bytes to encode in UTF-8.

wcstombs doesn't work properly

I have an utf-8 file which I can process normally with widechar functions.
However now I need to convert and use them in multibyte form and I'm struggling to make it work.
printf("%s\n",setlocale(LC_CTYPE, "")); //English_United States.1252
_setmbcp(_MB_CP_LOCALE);
FILE *f = NULL;
f = _wfopen(L"data.txt", L"r,ccs=UTF-8");
wchar_t x[256];
fwscanf(f, L"%ls", x); //x = L"một"
char mb[256];
int l = wcstombs(mb, x, 256); //mb = "m?t"
What did I do wrong?
In your textfile you have the character ộ (note the point below the character) instead of ô.
The character ô exists in codepage 1252, but the character ộ doesn't, and therefore wcstombs transforms it into a ?.
You will have the same problem if your UTF-8 encoded text file contains for example cyrillic or greek characters.
The only solution is not having characters that don't have a representation in codepage 1252.

Is it actually possible to store and process individual UTF-8 characters on C ? If so, how?

I've written a program in C that breaks words down into syllables, segments and letters. It's working well with ASCII characters but I want to make versions that work for the IPA and Arabic too.
I'm having massive problems saving and performing functions on individual characters. My editor and console are both set up to UTF-8 and can display Arabic text fine if I save it as a char*, but when I try to print wchars they display random punctuation marks.
My program needs to be able to recognise an individual UTF-8 character in order to work. For example, for the word 'though' it stores 't' as syllable[1]segment[1]letter[1], h as syllable[1]segment[1]letter[2] etc. I want to be able to do the same for non-ASCII characters.
I've spent basically the whole day researching unicode and trying out different methods and I can't get any of them to let me store an Arabic character as a character.
I'm not sure if I've just made some stupid syntax errors along the way, if I've completely misunderstood the whole concept, or if it actually just isn't possible to do what I want in C and I should just give up and try another language...
I would massively, massively, massively appreciate any help you can offer! I'm pretty new to programming, but unicode is completely instrumental to my work so I want to work out how to do it from the beginning.
My understanding of how unicode works (in case that's where I'm going wrong):
I type some text into my editor. My editor encodes it according to the encoding I have set. So if I set it to UFT-8 it will encode the Arabic letter ب with the 2 byte sequence 0xd8 0xab which indicates the code point U+0628.
I compile it, breaking down 0xd8 0xab into the binary 11011000 10101000.
I run it on the command prompt. The command prompt interprets the text according to the encoding I have set, so if I set it to UFT-8 it should interpret 11011000 10101000 as the code point U+0628. Unicode algorithms also tell it which version of U+0628 to display to me, as the character has different shapes depending on where it is in the word. As the character is alone it will show me the standalone version ب
My understanding of the ways I can process unicode in C:
Option A - Use single bytes encoded as UTF-8 (http://www.nubaria.com/en/blog/?p=289)
Use single bytes encoded as UTF-8. Leave all my datatypes as chars and char arrays and only type ASCII characters in my code. If I absolutely have to hard code a unicode character enter it as an array in the format:
const char kChineseSampleText[] = "\xe4\xb8\xad\xe6\x96\x87";
My problems with this:
I need to manipulate individual characters
Having to type Arabic characters as code points is going to render my code completely unreadable and slow me down immensely.
Option B - Use wchar and friends (http://icu-project.org/docs/papers/unicode_wchar_t.html)
Swap using chars for wchars, which hold 2 to 4 bytes depending on the compiler. String functions like strlen will not work as they are expecting characters to be one byte, but there are w functions like wprintf I can use instead.
My problem with this:
I can’t get wchars to print Arabic characters at all! I can get them to print English letters fine, but Arabic characters just pull through as random punctuation marks.
I've tried inputing the unicode code point as well as the actual Arabic character and I've tried printing them both to the console and to a UTF-8 encoded text file and I get the same result, even though both the console and the text file display Arabic text if entered as a char*. I've included my code at the end.
(It’s worth saying here that I am aware that a lot of people think wchars are bad because they aren’t very portable and because they take up extra space for ASCII characters. But at this stage, neither of those things are really a worry for me - I’m just writing the program to run on my own computer and the program will only be processing short strings.)
Option C - Use external libraries
I've read in various comments that external libraries are the way to go so I've tried:
C programming library
http://www.cprogramming.com/tutorial/unicode.html suggests replacing all chars with unsigned long integers and using special functions for iterating through strings etc. The site even provides a sample library to download.
My problem:
While I can set the character to be an unsigned long integer I can’t print it out, because the printf and wprintf functions don’t work, and neither does the library provided on the website (I think maybe the library was designed for Linux? Some of the datatypes are invalid and amending them didn't work either)
ICU library
My problem:
I downloaded the ICU library, but when I was looking into how to use it I saw that functionality such as the characterIterator is not available for use in C (http://userguide.icu-project.org/strings). Being able to iterate through characters is completely fundamental to what I need to do, so I don't think the library will work for me.
My code
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>
#include <string.h>
int main ()
{
wchar_t unicode = L'\xd8ac';
wchar_t arabic = L'ب';
wchar_t number = 0x062c;
FILE* f;
f = fopen("unitest.txt","w");
char* string = "ايه الاخبار";
//printf - works
printf("printf - literal arabic character is \"م\"\n");
fprintf(f,"printf - literal arabic character is \"م\"\n");
printf("printf - char* string is \"%s\"\n",string);
fprintf(f,"printf - char* string is \"%s\"\n",string);
//wprintf - english - works
wprintf(L"wprintf - literal english char is \"%C\"\n\n", L't');
fwprintf(f,L"wprintf - literal english char is \"%C\"\n\n", L't');
//wprintf - arabic - doesnt work
wprintf(L"wprintf - unicode wchar_t is \"%C\"\n", unicode);
fwprintf(f,L"wprintf - unicode wchar_t is \"%C\"\n", unicode);
wprintf(L"wprintf - unicode number wchar_t is \"%C\"\n", number);
fwprintf(f,L"wprintf - unicode number wchar_t is \"%C\"\n", number);
wprintf(L"wprintf - arabic wchar_t is \"%C\"\n", arabic);
fwprintf(f,L"wprintf - arabic wchar_t is \"%C\"\n", arabic);
wprintf(L"wprintf - literal arabic character is \"%C\"\n",L'ت');
fwprintf(f,L"wprintf - literal arabic character is \"%C\"\n",L'ت');
wprintf(L"wprintf - literal arabic character in string is \"م\"\n\n");
fwprintf(f,L"wprintf - literal arabic character in string is \"م\"\n\n");
fclose(f);
return 0;
}
Output file
printf - literal arabic character is "م"
printf - char* string is "ايه الاخبار"
wprintf - literal english char is "t"
wprintf - unicode wchar_t is "�"
wprintf - unicode number wchar_t is ","
wprintf - arabic wchar_t is "("
wprintf - literal arabic character is "*"
wprintf - literal arabic character in string is ""
I'm using Windows 10, Notepad++ and MinGW.
Edit
This got marked as a duplicate of Light C Unicode Library but I don't think it really answers my question. I've downloaded the library and had a look at and you can call me stupid if you like, but I'm really new to programming and I don't understand most of the code in the library, so it's hard for me to work out how I can use it achieve what I want. I searched the library for a print function and couldn't find one...
I just want to save a UTF-8 character and then print it out again! Do I really need to install an entire library to do that? I would just really appreciate someone taking pity on me and telling me in baby terms how I can do it... People keep saying I should use uint_32 or something instead of wchar - but how do I then print those datatypes? Can I do it with wprintf?!
C and UTF-8 are still getting to know each other. In-other-words, IMO, C support for UTF-8 is scant.
Is it ... possible to store and process individual UTF-8 characters ...?
First step is to make certain "ايه الاخبار" is a UTF-8 encoded string. C supports this explicitly with u8"ايه الاخبار".
A UTF-8 string is a sequence of char. Each 1 to 4 char represents a Unicode character. A Unicode character needs at least 21-bits for encoding. Yet OP does not needs to convert a portion of string[] into a Unicode character as much as wants to segment that string on UTF-8 boundaries. This is readily found by looking for UTF-8 continuation bytes.
The following forms a 1 Unicode character encoded as a UTF-8 string with the accompanying terminating null character. Then that short string is printed.
char* string = u8"ايه الاخبار";
for (char *s = string; *s; ) {
printf("<");
char u[5];
char *p = u;
*p++ = *s++;
if ((*s & 0xC0) == 0x80) *p++ = *s++;
if ((*s & 0xC0) == 0x80) *p++ = *s++;
if ((*s & 0xC0) == 0x80) *p++ = *s++;
*p = 0;
printf("%s", u);
printf(">\n");
}
With the output viewed with a UTF8 aware screen:
<ا>
<ي>
<ه>
< >
<ا>
<ل>
<ا>
<خ>
<ب>
<ا>
<ر>
An example with utf8proc library to iterate is:
#include <utf8proc.h>
#include <stdio.h>
int main(void) {
utf8proc_uint8_t const string[] = u8"ايه الاخبار";
utf8proc_ssize_t size = sizeof string / sizeof *string - 1;
utf8proc_int32_t data;
utf8proc_ssize_t n;
utf8proc_uint8_t const *pstring = string;
while ((n = utf8proc_iterate(pstring, size, &data)) > 0) {
printf("<%.*s>\n", (int)n, pstring);
pstring += n;
size -= n;
}
}
This is probably not the best way to use this library but I make an issue an github to have some example. Because, I'm unable to understand how work this library.
You need to very clearly understand the difference between a Unicode code point and UTF-8. UTF-8 is a variable byte encoding of Unicode code points. The lower end, values 0-127, is stored as a single byte. That's the main point of UTF-8, and makes it backwards compatible with Ascii.
When bit 7 is set, for values over 127, a variable length code of two bytes or more is used. The leading byte always has the bit pattern 11xxxxxx.
Here's code to get the skip (the number of character used), also to read a codepoint and to write one.
static const unsigned int offsetsFromUTF8[6] =
{
0x00000000UL, 0x00003080UL, 0x000E2080UL,
0x03C82080UL, 0xFA082080UL, 0x82082080UL
};
static const unsigned char trailingBytesForUTF8[256] = {
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};
int bbx_utf8_skip(const char *utf8)
{
return trailingBytesForUTF8[(unsigned char) *utf8] + 1;
}
int bbx_utf8_getch(const char *utf8)
{
int ch;
int nb;
nb = trailingBytesForUTF8[(unsigned char)*utf8];
ch = 0;
switch (nb)
{
/* these fall through deliberately */
case 3: ch += (unsigned char)*utf8++; ch <<= 6;
case 2: ch += (unsigned char)*utf8++; ch <<= 6;
case 1: ch += (unsigned char)*utf8++; ch <<= 6;
case 0: ch += (unsigned char)*utf8++;
}
ch -= offsetsFromUTF8[nb];
return ch;
}
int bbx_utf8_putch(char *out, int ch)
{
char *dest = out;
if (ch < 0x80)
{
*dest++ = (char)ch;
}
else if (ch < 0x800)
{
*dest++ = (ch>>6) | 0xC0;
*dest++ = (ch & 0x3F) | 0x80;
}
else if (ch < 0x10000)
{
*dest++ = (ch>>12) | 0xE0;
*dest++ = ((ch>>6) & 0x3F) | 0x80;
*dest++ = (ch & 0x3F) | 0x80;
}
else if (ch < 0x110000)
{
*dest++ = (ch>>18) | 0xF0;
*dest++ = ((ch>>12) & 0x3F) | 0x80;
*dest++ = ((ch>>6) & 0x3F) | 0x80;
*dest++ = (ch & 0x3F) | 0x80;
}
else
return 0;
return dest - out;
}
Using these functions or similar, you convert between code points and UTF-8
and back.
Windows currently uses UTF-16 for its apis. To a first approximation, UTF-16 is the code points in 16 bit format. So when writing a UTF-8 based program, you need to convert the UTF-8 to UTF-16 (using wide chars) immediately before calling Windows output functions.
Support for UTF-8 via printf() is patchy. Passing a UTF-8 encoded string to printf() is unlikely to do what you want.

How to change multicharacter signs by other ones in C?

I've got an UTF-8 text file containing several signs that i'd like to change by other ones (only those between |( and |) ), but the problem is that some of these signs are not considered as characters but as multi-character signs. (By this i mean they can't be put between '∞' but only like this "∞", so char * ?)
Here is my textfile :
Text : |(abc∞∪v=|)
For example :
∞ should be changed by ¤c
∪ by ¸!
= changed by "
So as some signs(∞ and ∪) are multicharacters, i decided to use fscanf to get all the text word by word. The problem with this method is that I have to put space between each character ... My file should look like this :
Text : |( a b c ∞ ∪ v = |)
fgetc can't be used because characters like ∞ can't be considered as one single character.If i use it I won't be able to strcmp a char with each sign (char * ), i tried to convert my char to char* but strcmp !=0.
Here is my code in C to help you understanding my problem :
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int main(void){
char *carac[]={"∞","=","∪"}; //array with our signs
FILE *flot,*flot3;
flot=fopen("fichierdeTest2.txt","r"); // input text file
flot3=fopen("resultat.txt","w"); //output file
int i=0,j=0;
char a[1024]; //array that will contain each read word.
while(!feof(flot))
{
fscanf(flot,"%s",&a[i]);
if (strstr(&a[i], "|(") != NULL){ // if the word read contains |( then j=1
j=1;
fprintf(flot3,"|(");
}
if (strcmp(&a[i], "|)") == 0)
j=0;
if(j==1) { //it means we are between |( and |) so the conversion can begin
if (strcmp(carac[0], &a[i]) == 0) { fprintf(flot3, "¤c"); }
else if (strcmp(carac[1], &a[i]) == 0) { fprintf(flot3,"\"" ); }
else if (strcmp(carac[2], &a[i]) == 0) { fprintf(flot3, " ¸!"); }
else fprintf(flot3,"%s",&a[i]); // when it's a letter, number or sign that doesn't need to be converted
}
else { // when we are not between |( and |) just copy the word to the output file with a space after it
fprintf(flot3, "%s", &a[i]);
fprintf(flot3, " ");
}
i++;
}
}
Thanks a lot for the future help !
EDIT : Every sign will be changed correctly if i put a space between each them but without ,it won't work, that's what i'm trying to solve.
First of all, get the terminology right. Proper terminology is a bit confusing, but at least other people will understand what you are talking about.
In C, char is the same as byte. However, a character is something abstract like ∞ or ¤ or c. One character may contain a few bytes (that is a few chars). Such characters are called multi-byte ones.
Converting a character to a sequence of bytes (encoding) is not trivial. Different systems do it differently; some use UTF-8, while others may use UTF-16 big-endian, UTF-16 little endian, a 8-bit codepage or any other encoding.
When your C program has something inside quotes, like "∞" - it's a C-string, that is, several bytes terminated by a zero byte. When your code uses strcmp to compare strings, it compares each byte of both strings, to make sure they are equal. So, if your source code and your input file use different encodings, the strings (byte sequences) won't match, even though you will see the same character when examining them!
So, to rule out any encoding mismatches, you might want to use a sequence of bytes instead of a character in your source code. For example, if you know that your input file uses the UTF-8 encoding:
char *carac[]={
"\xe2\x88\x9e", // ∞
"=",
"\xe2\x88\xaa"}; // ∪
Alternatively, make sure the encodings (of your source code and your program's input file) are the same.
Another, less subtle, problem: when comparing strings, you actually have a big string and a small string, and you want to check whether the big string starts with the small string. Here strcmp does the wrong thing! You must use strncmp here instead:
if (strncmp(carac[0], &a[i], strlen(carac[0])) == 0)
{
fprintf(flot3, "\xC2\xA4""c"); // ¤c
}
Another problem (actually, a major bug): the fscanf function reads a word (text delimited by spaces) from the input file. If you only examine the first byte in this word, the other bytes will not be processed. To fix, make a loop over all bytes:
fscanf(flot,"%s",a);
for (i = 0; a[i] != '\0'; )
{
if (strncmp(&a[i], "|(", 2)) // start pattern
{
now_replacing = 1;
i += 2;
continue;
}
if (now_replacing)
{
if (strncmp(&a[i], whatever, strlen(whatever)))
{
fprintf(...);
i += strlen(whatever);
}
}
else
{
fputc(a[i], output);
i += 1; // processed just one char
}
}
You're on the right track, but you need to look at characters differently than strings.
strcmp(carac[0], &a[i])
(Pretending i = 2) As you know this compares the string "∞" with &a[2]. But you forget that &a[2] is the address of the second character of the string, and strcmp works by scanning the entire string until it hits a null terminator. So "∞" actually ends up getting compared with "abc∞∪v=|)" because a is only null terminated at the very end.
What you should do is not use strings, but expand each character (8 bits) to a short (16 bits). And then you can compare them with your UTF-16 characters
if( 8734 = *((short *)&a[i])) { /* character is infinity */ }
The reason for that 8734 is because that's the UTF16 value of infinity.
VERY IMPORTANT NOTE:
Depending if your machine is big-endian or little-endian matters for this case. If 8734 (0x221E) does not work, give 7714 (0x1E22) a try.
Edit Something else I overlooked is you're scanning the entire string at once. "%s: String of characters. This will read subsequent characters until a whitespace is found (whitespace characters are considered to be blank, newline and tab)." (source)
//feof = false.
fscanf(flot,"%s",&a[i]);
//feof = ture.
That means you never actually iterate. You need to go back and rethink your scanning procedure.

Resources