Printing every character of a UTF8 string - c

I am new to Unicode/UTF8 representations of strings. I am trying to read a UTF8 encoded file, separate it with spaces and then print every character/code-point in every word (separated by spaces).
I was able to use wchar_t (I know it uses utf16 or utf32(?) internally) for reading text from the file, printing it and writing it to another file. However, I was unable to use the wchar_t to get either a substring or traverse it element by element.
To solve for this, I used the ICU library from IBM. Code:
while (fgetws(readString, 1000, wifile) != NULL) {
wprintf(L"String: %s\n", readString);
//split string on the base of spaces.
wchar_t *nextToken = NULL;
wchar_t *token = wcstok_s(readString, L" ", &nextToken);
UChar *utf8Token = (UChar *)token;
u_printf("Token in UChar: %S\n", utf8Token);
while (token != NULL) {
printf("Hello.\n");
fwprintf(wofileString, L"%ls and length: %d\n", token, wcslen(token));
fwprintf(wofileString, L"UTF8 rep:%s and length: %d\n", utf8Token, u_strlen(utf8Token));
int32_t counter = 0;
for (counter = 0; counter < u_strlen(utf8Token);) {
UChar32 ch;
U8_NEXT(utf8Token, counter, u_strlen(utf8Token), ch);
fwprintf(wofileString, L"Token[%d] = ", counter);
if (ch < 127) {
printf("Less than 127.\n");
if (ch > 1) {
printf("Printing%d.\n", ch);
u_fprintf((UFILE *)wofileString, "%c\n", (UChar)ch);
}
} else if (ch == CharacterIterator::DONE) {
printf("Done.\n");
u_fprintf((UFILE *)wofileString, "[CharacterIterator::DONE]\n");
} else {
printf("More than 127.\n");
u_fprintf((UFILE *)wofileString, "[%X]\n", ch);
}
}
token = wcstok_s(NULL, L" ", &nextToken);
utf8Token = (UChar *)token;
counter = 0;
}
fputws(L"Complete String: ", wofileString);
fputws(readString, wofileString);
fputws(L"\n", wofileString);
}
This program always stops working when it gets to the part where the characters are printed.
My questions:
1. How can I print all the 'characters' in the input UTF8 string?
2. Is the conversion: UChar *utf8Token = (UChar *) token; even correct? Given that the internal representation for token is UTF16 or UTF32?
3. Where am I going wrong?
4. How do I get a substring of the string?

fwprintf(wofileString,…
u_fprintf((UFILE *)wofileString,…
One of these two lines is wrong, depending on what wofileString actually is.
I'd recommend just using the u_… functions.
In fact, I'd just use u_printf("string", …) or u_printf_u(L"String", …) instead of fwprintf or fputws.

Related

C Programming - Space Character Not Detected

Mainly a Java/Python coder here. I am coding a tokenizer for an assignment. (I explicitly cannot use strtok().) The code below is meant to separate the file text into lexemes (aka words and notable characters).
char inText[256];
fgets(inText, 256, inf);
char lexemes[256][256];
int x = 0;
char string[256] = "\0";
for(int i=0; inText[i] != '\0'; i++)
{
char delims[] = " (){}";
char token = inText[i];
if(strstr(delims, &inText[i]) != NULL)
{
if(inText[i] == ' ') // <-- Problem Code
{
if(strlen(string) > 0)
{
strcpy(lexemes[x], string);
x++;
strcpy(string, "\0");
(*numLex)++;
}
}
else if(inText[i] == '(')
{
if(strlen(string) > 0)
{
strcpy(lexemes[x], string);
x++;
strcpy(string, "\0");
(*numLex)++;
}
strcpy(lexemes[x], &token);
x++;
(*numLex)++;
}
else
{
strcpy(lexemes[x], &token);
x++;
(*numLex)++;
}
}
else
{
strcat(string, (char[2]){token});
}
}
For some odd reason, my code cannot recognize the space character as ' ', as 32, or by using isspace(). There are no error messages, and I have confirmed that the code is reaching the space in the text.
This is driving me insane. Does anyone have any idea what is happening here?
You are using the function strstr incorrectly.
if(strstr(delims, &inText[i]) != NULL)
the function searches exactly the string pointed to by the pointer expression &inText[i] in the string " (){}".
Instead you need to use another function that is strcspn.
Something like
i = strcspn( &inText[i], delims );
or you can introduce another variable like for example
size_t n = strcspn( &inText[i], delims );
depending on the logic of the processing you are going to follow.
Or more probably you need to use the function strchr like
if(strchr( delims, inText[i]) != NULL)

C - Program not detecting blank spaces

I want my program to read a file containing words separated by blank spaces and then prints words one by one. This is what I did:
char *phrase = (char *)malloc(LONGMAX * sizeof(char));
char *mot = (char *)malloc(TAILLE * sizeof(char));
FILE *fp = NULL;
fp = fopen("mots.txt", "r");
if (fp == NULL) {
printf("err ");
} else {
fgets(phrase, LONGMAX, fp);
while (phrase[i] != '\0') {
if (phrase[i] != " ") {
mot[m] = phrase[i];
i++;
m++;
} else {
printf("%s\n", phrase[i]);
mot = "";
}
}
}
but it isn't printing anything! Am I doing something wrong? Thanks!
The i in the following:
while (phrase[i]!='\0'){
Should be initialized to 0 before being used, then incremented as you iterate through the string.
You have not shown where/how it is created.
Also in this line,
if(phrase[i]!=" "){
the code is comparing a char: (phrase[i]) with a string: ( " " )
// char string
if(phrase[i] != " " ){
change it to:
// char char
if(phrase[i] != ' '){
//or better yet, include all whitespace:
if(isspace(phrase[i]) {
There is no error checking in the following, but it is basically your code with modifications. Read comments for explanation on edits to fgets() usage, casting return of malloc(), how and when to terminate the output buffer mot, etc.:
This performs the following: read a file containing words separated by blank spaces and then prints words one by one.
int main(void)
{
int i = 0;
int m = 0;
char* phrase=malloc(LONGMAX);//sizeof(char) always == 0
if(phrase)//test to make sure memory created
{
char* mot=malloc(TAILLE);//no need to cast the return of malloc in C
if(mot)//test to make sure memory created
{
FILE* fp=NULL;
fp=fopen("_in.txt","r");
if(fp)//test to make sure fopen worked
{//shortcut of what you had :) (left off the print err)
i = 0;
m = 0;
while (fgets(phrase,LONGMAX,fp))//fgets return NULL when no more to read.
{
while(phrase[i] != NULL)//test for end of last line read
{
// if(phrase[i] == ' ')//see a space, terminate word and write to stdout
if(isspace(phrase[i])//see ANY white space, terminate and write to stdout
{
mot[m]=0;//null terminate
if(strlen(mot) > 0) printf("%s\n",mot);
i++;//move to next char in phrase.
m=0;//reset to capture next word
}
else
{
mot[m] = phrase[i];//copy next char into mot
m++;//increment both buffers
i++;// "
}
}
mot[m]=0;//null terminate after while loop
}
//per comment about last word. Print it out here.
mot[m]=0;
printf("%s\n",mot);
fclose(fp);
}
free(mot);
}
free(phrase);
}
return 0;
}
phrase[i]!=" "
You compare character (phrase[i]) and string (" "). If you want to compare phrase[i] with space character, use ' ' instead.
If you want to compare string, use strcmp.
printf("%s\n",phrase[i]);
Here, you use %s for printing the string, but phrase[i] is a character.
Do not use mot=""; to copy string in c. You should use strcpy:
strcpy(mot, " ");
If you want to print word by word from one line of string. You can use strtok to split string by space character.
fgets(phrase,LONGMAX,fp);
char * token = strtok(phrase, " ");
while(token != NULL) {
printf("%s \n", token);
token = strtok(NULL, " ");
}
OT, your program will get only one line in the file because you call only one time fgets. If your file content of many line, you should use a loop for fgets function.
while(fgets(phrase,LONGMAX,fp)) {
// do something with pharse string.
// strtok for example.
char * token = strtok(phrase, " ");
while(token != NULL) {
printf("%s \n", token);
token = strtok(NULL, " ");
}
}
Your program has multiple problems:
the test for end of file is incorrect: you should just compare the return value of fgets() with NULL.
the test for spaces is incorrect: phrase[i] != " " is a type mismatch as you are comparing a character with a pointer. You should use isspace() from <ctype.h>
Here is a much simpler alternative that reads one byte at a time, without a line buffer nor a word buffer:
#include <ctype.h>
#include <stdio.h>
int main() {
int inword = 0;
int c;
while ((c = getchar()) != EOF) {
if (isspace(c)) {
if (inword) {
putchar('\n');
inword = 0;
}
} else {
putchar(c);
inword = 1;
}
}
if (inword) {
putchar('\n');
}
return 0;
}

parsing a file while reading in c

I am trying to read each line of a file and store binary values into appropriate variables.
I can see that there are many many other examples of people doing similar things and I have spent two days testing out different approaches that I found but still having difficulties getting my version to work as needed.
I have a txt file with the following format:
in = 00000000000, out = 0000000000000000
in = 00000000001, out = 0000000000001111
in = 00000000010, out = 0000000000110011
......
I'm attempting to use fscanf to consume the unwanted characters "in = ", "," and "out = "
and keep only the characters that represent binary values.
My goal is to store the first column of binary values, the "in" values into one variable
and the second column of binary values, the "out" value into another buffer variable.
I have managed to get fscanf to consume the "in" and "out" characters but I have not been
able to figure out how to get it to consume the "," "=" characters. Additionally, I thought that fscanf should consume the white space but it doesn't appear to be doing that either.
I can't seem to find any comprehensive list of available directives for scanners, other than the generic "%d, %s, %c....." and it seems that I need a more complex combination of directives to filter out the characters that I'm trying to ignore than I know how to format.
I could use some help with figuring this out. I would appreciate any guidance you could
provide to help me understand how to properly filter out "in = " and ", out = " and how to store
the two columns of binary characters into two separate variables.
Here is the code I am working with at the moment. I have tried other iterations of this code using fgetc() in combination with fscanf() without success.
int main()
{
FILE * f = fopen("hamming_demo.txt","r");
char buffer[100];
rewind(f);
while((fscanf(f, "%s", buffer)) != EOF) {
fscanf(f,"%[^a-z]""[^,]", buffer);
printf("%s\n", buffer);
}
printf("\n");
return 0;
}
The outputs from my code appear as follows:
= 00000000000,
= 0000000000000000
= 00000000001,
= 0000000000001111
= 00000000010,
= 0000000000110011
Thank you for your time.
The scanf family function is said to be a poor man'parser because it is not very tolerant to input errors. But if you are sure of the format of the input data it allows for simple code. The only magic here if that a space in the format string will gather all blank characters including new lines or none. Your code could become:
int main()
{
FILE * f = fopen("hamming_demo.txt", "r");
if (NULL == f) { // always test open
perror("Unable to open input file");
return 1;
}
char in[50], out[50]; // directly get in and out
// BEWARE: xscanf returns the number of converted elements and never EOF
while (fscanf(f, " in = %[01], out = %[01]", in, out) == 2) {
printf("%s - %s\n", in, out);
}
printf("\n");
return 0;
}
So basically you want to filter '0' and '1'? In this case fgets and a simple loop will be enough: just count the number of 0's and 1's and null-terminate the string at the end:
#include <stdio.h>
int main(void)
{
char str[50];
char *ptr;
// Replace stdin with your file
while ((ptr = fgets(str, sizeof str, stdin)))
{
int count = 0;
while (*ptr != '\0')
{
if ((*ptr >= '0') && (*ptr <= '1'))
{
str[count++] = *ptr;
}
ptr++;
}
str[count] = '\0';
puts(str);
}
}

C Reading a file of digits separated by commas

I am trying to read in a file that contains digits operated by commas and store them in an array without the commas present.
For example: processes.txt contains
0,1,3
1,0,5
2,9,8
3,10,6
And an array called numbers should look like:
0 1 3 1 0 5 2 9 8 3 10 6
The code I had so far is:
FILE *fp1;
char c; //declaration of characters
fp1=fopen(argv[1],"r"); //opening the file
int list[300];
c=fgetc(fp1); //taking character from fp1 pointer or file
int i=0,number,num=0;
while(c!=EOF){ //iterate until end of file
if (isdigit(c)){ //if it is digit
sscanf(&c,"%d",&number); //changing character to number (c)
num=(num*10)+number;
}
else if (c==',' || c=='\n') { //if it is new line or ,then it will store the number in list
list[i]=num;
num=0;
i++;
}
c=fgetc(fp1);
}
But this is having problems if it is a double digit. Does anyone have a better solution? Thank you!
For the data shown with no space before the commas, you could simply use:
while (fscanf(fp1, "%d,", &num) == 1 && i < 300)
list[i++] = num;
This will read the comma after the number if there is one, silently ignoring when there isn't one. If there might be white space before the commas in the data, add a blank before the comma in the format string. The test on i prevents you writing outside the bounds of the list array. The ++ operator comes into its own here.
First, fgetc returns an int, so c needs to be an int.
Other than that, I would use a slightly different approach. I admit that it is slightly overcomplicated. However, this approach may be usable if you have several different types of fields that requires different actions, like a parser. For your specific problem, I recommend Johathan Leffler's answer.
int c=fgetc(f);
while(c!=EOF && i<300) {
if(isdigit(c)) {
fseek(f, -1, SEEK_CUR);
if(fscanf(f, "%d", &list[i++]) != 1) {
// Handle error
}
}
c=fgetc(f);
}
Here I don't care about commas and newlines. I take ANYTHING other than a digit as a separator. What I do is basically this:
read next byte
if byte is digit:
back one byte in the file
read number, irregardless of length
else continue
The added condition i<300 is for security reasons. If you really want to check that nothing else than commas and newlines (I did not get the impression that you found that important) you could easily just add an else if (c == ... to handle the error.
Note that you should always check the return value for functions like sscanf, fscanf, scanf etc. Actually, you should also do that for fseek. In this situation it's not as important since this code is very unlikely to fail for that reason, so I left it out for readability. But in production code you SHOULD check it.
My solution is to read the whole line first and then parse it with strtok_r with comma as a delimiter. If you want portable code you should use strtok instead.
A naive implementation of readline would be something like this:
static char *readline(FILE *file)
{
char *line = malloc(sizeof(char));
int index = 0;
int c = fgetc(file);
if (c == EOF) {
free(line);
return NULL;
}
while (c != EOF && c != '\n') {
line[index++] = c;
char *l = realloc(line, (index + 1) * sizeof(char));
if (l == NULL) {
free(line);
return NULL;
}
line = l;
c = fgetc(file);
}
line[index] = '\0';
return line;
}
Then you just need to parse the whole line with strtok_r, so you would end with something like this:
int main(int argc, char **argv)
{
FILE *file = fopen(argv[1], "re");
int list[300];
if (file == NULL) {
return 1;
}
char *line;
int numc = 0;
while((line = readline(file)) != NULL) {
char *saveptr;
// Get the first token
char *tok = strtok_r(line, ",", &saveptr);
// Now start parsing the whole line
while (tok != NULL) {
// Convert the token to a long if possible
long num = strtol(tok, NULL, 0);
if (errno != 0) {
// Handle no value conversion
// ...
// ...
}
list[numc++] = (int) num;
// Get next token
tok = strtok_r(NULL, ",", &saveptr);
}
free(line);
}
fclose(file);
return 0;
}
And for printing the whole list just use a for loop:
for (int i = 0; i < numc; i++) {
printf("%d ", list[i]);
}
printf("\n");

Extract character from a string bug

I read in a temp variable from a file, this is one word, e.g. "and", however, when I extract the first character, e.g. temp[1], the program crashes when running, I have tried break points, and it is on this line.
This is what happens when I run the code: http://prntscr.com/2vzkmp
These are the words when I don't try to extract a letter: http://prntscr.com/2vzktn
This is the error when I use breakpoints: http://prntscr.com/2vzlr3
This is the line that is messing up: " printf("\n%s \n",temp[0]);"
Here is the code:
int main(void)
{
char **dictmat;
char temp[100];
int i = 0, comp, file, found = 0, j = 0, foundmiss = 0;
FILE* input;
dictmat = ReadDict();
/*opens the text file*/
input = fopen("y:\\textfile.txt", "r");
/*checks if we can open the file, otherwise output error message*/
if (input == NULL)
{
printf("Could not open textfile.txt for reading \n");
}
else
{
/*allocates the memory location to the rows using a for loop*/
do
{
/*temp_line is now the contents of the line in the file*/
file = fscanf(input, "%s", temp);
if (file != EOF)
{
lowercase_remove_punct(temp, temp);
for (i = 0; i < 1000; i++)
{
comp = strcmp(temp, dictmat[i]);
if (comp == 0)
{
/*it has found the word in the dictionary*/
found = 1;
}
}
/*it has not found a word in the dictionay, so the word must be misspelt*/
if (found == 0 && (strcmp(temp, "") !=0))
{
/*temp is the variable that is misspelt*/
printf("\n%s \n",temp[0]);
/*checks for a difference of one letter*/
//one_let(temp);
}
found = 0;
foundmiss = 0;
}
} while (file != EOF);
/*closes the file*/
fclose(input);
}
free_matrix(dictmat);
return 0;
}
When printing a character, use %c, not %s.
There is a fundamental difference between the two. The latter is for strings.
When printf encounters a %c it inserts one byte in ASCII format into the output stream from the variable specified.
When it sees a %s it will interpret the variable as a character pointer, and start copying bytes in ASCII format from the address specified in the variable, until it encounters a byte that contains zero.
print char - not string:
printf("\n%c \n",temp[0]);
temp[0] is a charater. Thus if you are using
printf("\n%s \n",temp[0]);
it will print the string from address i.e. temp[0]. May be this location is not accessible, So it is crashing.
This change it to
printf("\n%c \n",temp[0]);
Why are you using %s as modifier, use %c

Resources