Why does this bison code produce unexpected output? - c

flex code:
1 %option noyywrap nodefault yylineno case-insensitive
2 %{
3 #include "stdio.h"
4 #include "tp.tab.h"
5 %}
6
7 %%
8 "{" {return '{';}
9 "}" {return '}';}
10 ";" {return ';';}
11 "create" {return CREATE;}
12 "cmd" {return CMD;}
13 "int" {yylval.intval = 20;return INT;}
14 [a-zA-Z]+ {yylval.strval = yytext;printf("id:%s\n" , yylval.strval);return ID;}
15 [ \t\n]
16 <<EOF>> {return 0;}
17 . {printf("mistery char\n");}
18
bison code:
1 %{
2 #include "stdlib.h"
3 #include "stdio.h"
4 #include "stdarg.h"
5 void yyerror(char *s, ...);
6 #define YYDEBUG 1
7 int yydebug = 1;
8 %}
9
10 %union{
11 char *strval;
12 int intval;
13 }
14
15 %token <strval> ID
16 %token <intval> INT
17 %token CREATE
18 %token CMD
19
20 %type <strval> col_definition
21 %type <intval> create_type
22 %start stmt_list
23
24 %%
25 stmt_list:stmt ';'
26 | stmt_list stmt ';'
27 ;
28
29 stmt:create_cmd_stmt {/*printf("create cmd\n");*/}
30 ;
31
32 create_cmd_stmt:CREATE CMD ID'{'create_col_list'}' {printf("%s\n" , $3);}
33 ;
34 create_col_list:col_definition
35 | create_col_list col_definition
36 ;
37
38 col_definition:create_type ID ';' {printf("%d , %s\n" , $1, $2);}
39 ;
40
41 create_type:INT {$$ = $1;}
42 ;
43
44 %%
45 extern FILE *yyin;
46
47 void
48 yyerror(char *s, ...)
49 {
50 extern yylineno;
51 va_list ap;
52 va_start(ap, s);
53 fprintf(stderr, "%d: error: ", yylineno);
54 vfprintf(stderr, s, ap);
55 fprintf(stderr, "\n");
56 }
57
58 int main(int argc , char *argv[])
59 {
60 yyin = fopen(argv[1] , "r");
61 if(!yyin){
62 printf("open file %s failed\n" ,argv[1]);
63 return -1;
64 }
65
66 if(!yyparse()){
67 printf("parse work!\n");
68 }else{
69 printf("parse failed!\n");
70 }
71
72 fclose(yyin);
73 return 0;
74 }
75
test input file:
create cmd keeplive
{
int a;
int b;
};
test output:
root#VM-Ubuntu203001:~/test/tpp# ./a.out t1.tp
id:keeplive
id:a
20 , a;
id:b
20 , b;
keeplive
{
int a;
int b;
}
parse work!
I have two questions:
1) Why does the action at line 38 print the token ';'? For instance, "20 , a;" and "20 , b;"
2) Why does the action at line 32 print "keeplive
{
int a;
int b;
}" instead of simply "keeplive"?

Short answer:
yylval.strval = yytext;
You can't use yytext like that. The string it points to is private to the lexer and will change as soon as the flex action finishes. You need to do something like:
yylval.strval = strdup(yytext);
and then you need to make sure you free the memory afterwards.
Longer answer:
yytext is actually a pointer into the buffer containing the input. In order to make yytext work as though it were a NUL-terminated string, the flex framework overwrites the character following the token with a NUL before it does the action, and then replaces the original character when the action terminates. So strdup will work fine inside the action, but outside the action (in your bison code), you now have a pointer to the part of the buffer starting with the token. And it gets worse later, since flex will read the next part of the source into the same buffer, and now your pointer is to random garbage. There are several possible scenarios, depending on flex options, but none of them are pretty.
So the golden rule: yytext is only valid until the end of the action. If you want to keep it, copy it, and then make sure you free the storage for the copy when you no longer need it.
In almost all the lexers I've written, the ID token actually finds the identifier in a symbol table (or puts it there) and returns a pointer into the symbol table, which simplifies memory management. But you still have essentially the same memory management issue with, for example, character string literals.

Related

How to install and include a library to my C compiler?

I am using Debian 11
I am trying to reproduce strlcpy.
1 #include <stdio.h>
2 #include <string.h>
3 #include <stdlib.h>
4
5 unsigned int ft_strlcpy(char *dest, char *src, unsigned int size)
6 {
7 unsigned int i;
8
9 i = 0;
10 while (src[i] && i < size)
11 {
12 dest[i] = src[i];
13 i++;
14 }
15 dest[i] = '\0';
16 while (src[i])
17 i++;
18 return (i);
19 }
20
21 int main()
22 {
23 unsigned int i;
24 char *dest1 = malloc(sizeof(char) * 50);
25 char *dest2 = malloc(sizeof(char) * 50);
26
27 i = 0;
28 while (i < 26)
29 {
30 printf("%d ", ft_strlcpy(dest1, "hello my name is marcel", i));
31 printf("%s\n", dest1);
32 printf("%ld ", strlcpy(dest2, "hello my name is marcel", i));
33 printf("%s\n", dest2);
34 i++;
35 }
36 free(dest1);
37 free(dest2);
38 return (0);
39 }
However, I get this message when I compile my code:
ft_strlcpy.c: In function ‘main’:
ft_strlcpy.c:32:18: warning: implicit declaration of function ‘strlcpy’; did you mean ‘strncpy’? [-Wimplicit-function-declaration]
32 | printf("%ld ", strlcpy(dest2, "hello my name is marcel", i));
| ^~~~~~~
| strncpy
/usr/bin/ld: /tmp/ccukR8g6.o: in function `main':
ft_strlcpy.c:(.text+0xf0): undefined reference to `strlcpy'
collect2: error: ld returned 1 exit status
make: *** [<builtin>: ft_strlcpy] Error 1
I have no idea how to include libbsd or use pkgconf.
I have tried for a couple of hours, but I couldn't find the solution.
If someone could redirect me to a manual or explain the concepts, that would be great.
Thank you for your help!
The strlcpy function is included in the BSD libc, a superset (extended version) of the POSIX standard library for BSD operating systems. For it to be recognized by the compiler you have to first install the library through your package manager, the name of which will be either libbsd, libbsd-dev or libbsd-devel depending on whether your distribution uses seperate development libraries or not, and than include it as <bsd/string.h>. You can then compile it with (assuming you use GCC) gcc <your-filename>.c -lbsd, specifying the library to be linked. I wouldn't recommended using BSD functions outside of BSD specific software due to portability issues (POSIX incompliences).

Printing wide character literals in C

I am trying to print unicode to the terminal under linux using the wchar_t type defined in the wchar.h header. I have tried the following:
#include <wchar.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
/*
char* direct = "\xc2\xb5";
fprintf(stderr, "%s\n", direct);
*/
wchar_t* dir_lit = L"μ";
wchar_t* uni_lit = L"\u03BC";
wchar_t* hex_lit = L"\xc2\xb5";
fwprintf(stderr,
L"direct: %ls, unicode: %ls, hex: %ls\n",
dir_lit,
uni_lit,
hex_lit);
return 0;
}
and compiled it using gcc -O0 -g -std=c11 -o main main.c.
This produces the output direct: m, unicode: m, hex: ?u (based on a terminal with LANG=en_US.UTF-8). In hex:
00000000 64 69 72 65 63 74 3a 20 6d 2c 20 75 6e 69 63 6f |direct: m, unico|
00000010 64 65 3a 20 6d 2c 20 68 65 78 3a 20 3f 75 0a |de: m, hex: ?u.|
0000001f
The only way that I have managed to obtain the desired output of μ is via the code commented in above (as a char* consisting of hex digits).
I have also tried to print based on the wcstombs funtion:
void print_wcstombs(wchar_t* str)
{
char buffer[100];
wcstombs(buffer, str, sizeof(buffer));
fprintf(stderr, "%s\n", buffer);
}
If I call for example print_wcstombs(dir_lit), then nothing is printed at all, so this approach does not seem to work at all.
I would be contend with the hex digit solution in principle, however, the length of the string is not calulated correctly (should be one, but is two bytes long), so formatting via printf does not work correctly.
Is there any way to handle / print unicode literals the way I intend using the wchar_t type?
With your program as-is, I compiled and ran it to get
direct: ?, unicode: ?, hex: ?u
I then included <locale.h> and added a setlocale(LC_CTYPE, ""); at the very beginning of the main() function, which, when run using a Unicode locale (LANG=en_US.UTF-8), produces
direct: μ, unicode: μ, hex: µ
(Codepoint 0xC2 is  in Unicode and 0xB5 is µ (U+00B5 MICRO SIGN as oppposed to U+03BC GREEK SMALL LETTER MU); hence the characters seen for the 'hex' output; results might vary if using an environment that does not use Unicode for wide characters).
Basically, to output wide characters you need to set the ctype locale so the stdio system knows how to convert them to the multibyte ones expected by the underlying system.
The updated program:
#include <wchar.h>
#include <stdio.h>
#include <locale.h>
int main(int argc, char *argv[])
{
setlocale(LC_CTYPE, "");
wchar_t* dir_lit = L"μ";
wchar_t* uni_lit = L"\u03BC";
wchar_t* hex_lit = L"\xc2\xb5";
fwprintf(stderr,
L"direct: %ls, unicode: %ls, hex: %ls\n",
dir_lit,
uni_lit,
hex_lit);
return 0;
}

using Makefiles in c, declaring variables multiple definitions

i am using makefiles in c, i have three files and each of the three have all the declarations of each variable. so it looks like this when i compile.
/usr/bin/ld: comp_disc.o:(.bss+0x8): multiple definition of `Cost_of_purchase'; main.o:(.bss+0x8): first defined here
/usr/bin/ld: comp_disc.o:(.bss+0x10): multiple definition of `DiscTot'; main.o:(.bss+0x10): first defined here
/usr/bin/ld: comp_disc.o:(.bss+0x18): multiple definition of `Sales_tax'; main.o:(.bss+0x18): first defined here
/usr/bin/ld: comp_disc.o:(.bss+0x20): multiple definition of `Total_price'; main.o:(.bss+0x20): first defined here
/usr/bin/ld: comp_disc.o:(.bss+0x28): multiple definition of `military'; main.o:(.bss+0x28): first defined here
but when i only keep those declarations on main.c i get this.
comp_disc.c:10:12: error: ‘Cost_of_purchase’ undeclared (first use in this function)
10 | if(Cost_of_purchase > 150) {
| ^~~~~~~~~~~~~~~~
comp_disc.c:11:13: error: ‘Mdisc’ undeclared (first use in this function)
11 | Mdisc = .15 * Cost_of_purchase;
so I'm wondering what i need to do so that my variables are declared correctly using make
here is my makefile
# target : dependencies
2 cwork7 : main.o comp_disc.o print_res.o
3 gcc main.o comp_disc.o print_res.o -Wall -o cwork7
4
5 main.o : main.c
6 gcc -c main.c -Wall
7
8 comp_disc : comp_disc.c
9 gcc -c comp_disc.c -Wall
10
11 print_res.o : print_res.c
12 gcc -c print_res.c -Wall
my main.c
5 #include <stdio.h>
6 //functions prototypes
7 void compute_discount(void);
8 int print_results(void);
9
10
11 //defined Gloabal var
12 double Mdisc;
13 double Cost_of_purchase;
14 double DiscTot;
15 double Sales_tax;
16 double Total_price;
17 char military;
18
19 int main (void) {
20 //declare variables
21
22 //Cost of purchase
23 printf("Cost of purchase?\t\t$");
24 scanf ("%lf",&Cost_of_purchase);
25
26 //Military?
27 printf("In military (y or n)?\t\t");
28 scanf(" %c" ,&military);
29
30 //calling for functions
31 compute_discount();
32 print_results();
33
34 }
35
36
my print_res.c
1 #include <stdio.h>
2
3 //function to print results
4 int print_results(void){
5
6 //if input is y Y then use below, this is not dependant on if military only if the letter is accepted
7 switch(military){
8 case 'y':
9 case 'Y':
10 printf("Military discount (15%%): \t$%.2f\n", Mdisc);
11 printf("Discounted total: \t\t$%.2f\n", DiscTot);
12 printf("Sales tax (5%%): \t\t$%.2f\n", Sales_tax);
13 printf("Total: \t\t\t\t$%.2f\n", Total_price);
14 break;
15 //less information is given when n or N is used
16 case 'n':
17 case 'N':
18 printf("Sales tax (5%%): \t\t$%.2f\n", Sales_tax);
19 printf("Total: \t\t\t\t$%.2f\n", Total_price);
20 break;
21 }
22 return(0);
23 }
and my comp_disc.c
1 #include <stdio.h>
2
3 //function to compute discount
4 void compute_discount(void){
5
6 //compute military discount
7 switch(military){
8 case 'y':
9 case 'Y':
10 if(Cost_of_purchase > 150) {
11 Mdisc = .15 * Cost_of_purchase;
12 } else if (Cost_of_purchase < 150) {
13 Mdisc = .10 * Cost_of_purchase;
14 }
15 break;
16 case 'n':
17 case 'N':
18 Mdisc = 0;
19 break;
20 default:
21 printf("Error: bad input\n");
22 }
23
24 //cost minus military discount
25 DiscTot = Cost_of_purchase - Mdisc;
26 //sales tax
27 Sales_tax = .05 * DiscTot;
28 //Total Calculated
29 Total_price = DiscTot + Sales_tax;
30
31 }
Please let me know what you think is the issue.
This has nothing to do with the Makefile.
If you define the variables in all source file you get exactly what the linker says, multiple definitions of the same name. And if you drop them from the file you obviously get a compile error as you are using variables the compiler does not know about.
The simple solution is to keep the variables in main as-is, but to define them as extern in all other files, like extern double Cost_of_purchase; That tells the compiler the variable exists, but is already defined elsewhere, which solves the problem.
However, just don't use global variables. Pass your data to the functions.
struct acc_data {
double Mdisc;
double Cost_of_purchase;
double DiscTot;
double Sales_tax;
double Total_price;
char military;
}
int main(void)
{
struct acc_data acc = { 0 };
// init code skipped
compute_discount(&acc);
print_results(&acc);
}
void compute_discount(struct acc_data *acc)
{
//same as before but prefix variables with acc->
// example:
acc->Total_price = 5000.0;
}
That gets rid of your original problem and improves your code somewhat.
Put the definition of the struct in a header file you include in all C files that use it.

krb-protos.h Syntax error: possible missing identifier?

I am using krb-protos.h from krb4 package for my openssh code compilation on AIX 7.1 but facing below error,
"/usr/athena/include/krb-protos.h", line 71.9: 1506-046 (S) Syntax error.
"/usr/athena/include/krb-protos.h", line 75.15: 1506-275 (S) Unexpected text c encountered.
"/usr/athena/include/krb-protos.h", line 76.9: 1506-276 (S) Syntax error: possible missing identifier?
"/usr/athena/include/krb-protos.h", line 82.9: 1506-335 (S) Parameter identifier list contains multiple occurrences of KTEXT.
"/usr/athena/include/krb-protos.h", line 74.1: 1506-282 (S) The type of the parameters must be specified in a prototype.
"/usr/athena/include/krb-protos.h", line 88.15: 1506-275 (S) Unexpected text pkt encountered.
"/usr/athena/include/krb-protos.h", line 89.9: 1506-276 (S) Syntax error: possible missing identifier?
"/usr/athena/include/krb-protos.h", line 87.1: 1506-282 (S) The type of the parameters must be specified in a prototype.
"/usr/athena/include/krb-protos.h", line 98.15: 1506-275 (S) Unexpected text tkt encountered.
"/usr/athena/include/krb-protos.h", line 99.9: 1506-276 (S) Syntax error: possible missing identifier?
snap of krb-protos.h file :
. . .
68
69 void KRB_LIB_FUNCTION
70 afs_string_to_key __P((
71 const char *str,
72 const char *cell,
73 des_cblock *key));
74
75 int KRB_LIB_FUNCTION
76 create_ciph __P((
77 KTEXT c,
78 unsigned char *session,
79 char *service,
80 char *instance,
81 char *realm,
82 u_int32_t life,
83 int kvno,
84 KTEXT tkt,
85 u_int32_t kdc_time,
86 des_cblock *key));
87
88 int KRB_LIB_FUNCTION
89 cr_err_reply __P((
90 KTEXT pkt,
91 char *pname,
92 char *pinst,
93 char *prealm,
94 u_int32_t time_ws,
95 u_int32_t e,
96 char *e_string));
97
. . .
I am using xlc compiler on AIX. any idea what could be the issue ?
xlc compiler with options :
xlc -DAFS -DKRB4 -L/usr/athena/lib -I/usr/athena/include -DSTATIC_AFS_SYSCALLS=1 -DOPENSSL_DES_LIBDES_COMPATIBILITY=1 -bloadmap -I. -I. -I/usr/include -I/usr/include/gssapi -I/usr/include/gssapi -DSSHDIR=\"/etc/ssh\" -D_PATH_SSH_PROGRAM=\"/opt/freeware/bin/ssh\" -D_PATH_SSH_ASKPASS_DEFAULT=\"/opt/freeware/libexec/openssh/ssh-askpass\" -D_PATH_SFTP_SERVER=\"/opt/freeware/libexec/openssh/sftp-server\" -D_PATH_SSH_KEY_SIGN=\"/opt/freeware/libexec/openssh/ssh-keysign\" -D_PATH_SSH_PKCS11_HELPER=\"/opt/freeware/libexec/openssh/ssh-pkcs11-helper\" -D_PATH_SSH_PIDDIR=\"/var/run\" -D_PATH_PRIVSEP_CHROOT_DIR=\"/var/empty\" -DHAVE_CONFIG_H -c ssh_api.c -o ssh_api.o

Unescape a universal character name to the corresponding character in C

NEW EDIT:
Basically I've provided a example that isn't correct. In my real application the string will of course not always be "C:/Users/Familjen-Styren/Documents/V\u00E5gformer/20140104-0002/text.txt". Instead I will have a input window in java and then I will "escape" the unicode characters to a universal character name. And then it will be "unescaped" in C (I do this to avoid problems with passing multibyte characters from java to c). So here is a example where I actually ask the user to input a string (filename):
#include <stdio.h>
#include <string.h>
int func(const char *fname);
int main()
{
char src[100];
scanf("%s", &src);
printf("%s\n", src);
int exists = func((const char*) src);
printf("Does the file exist? %d\n", exists);
return exists;
}
int func(const char *fname)
{
FILE *file;
if (file = fopen(fname, "r"))
{
fclose(file);
return 1;
}
return 0;
}
And now it will think the universal character names is just a part of the actual filename. So how do I "unescape" the universal character names included in the input?
FIRST EDIT:
So I compile this example like this: "gcc -std=c99 read.c" where 'read.c' is my source file. I need the -std=c99 parameter because I'm using the prefix '\u' for my universal character name. If I change it to '\x' it works fine, and I can remove the -std=c99 parameter. But in my real application the input will not use the prefix '\x' instead it will be using the prefix '\u'. So how do I work around this?
This code gives the desired result but for my real application I can't really use '\x':
#include <stdio.h>
#include <string.h>
int func(const char *fname);
int main()
{
char *src = "C:/Users/Familjen-Styren/Documents/V\x00E5gformer/20140104-0002/text.txt";
int exists = func((const char*) src);
printf("Does the file exist? %d\n", exists);
return exists;
}
int func(const char *fname)
{
FILE *file;
if (file = fopen(fname, "r"))
{
fclose(file);
return 1;
}
return 0;
}
ORIGINAL:
I've found a few examples of how to do this in other programming languages like javascript but I couldn't find any example on how to do this in C. Here is a sample code which produces the same error:
#include <stdio.h>
#include <string.h>
int func(const char *fname);
int main()
{
char *src = "C:/Users/Familjen-Styren/Documents/V\u00E5gformer/20140104-0002/text.txt";
int len = strlen(src); /* This returns 68. */
char fname[len];
sprintf(fname,"%s", src);
int exists = func((const char*) src);
printf("%s\n", fname);
printf("Does the file exist? %d\n", exists); /* Outputs 'Does the file exist? 0' which means it doesn't exist. */
return exists;
}
int func(const char *fname)
{
FILE *file;
if (file = fopen(fname, "r"))
{
fclose(file);
return 1;
}
return 0;
}
If I instead use the same string without universal character names:
#include <stdio.h>
#include <string.h>
int func(const char *fname);
int main()
{
char *src = "C:/Users/Familjen-Styren/Documents/Vågformer/20140104-0002/text.txt";
int exists = func((const char*) src);
printf("Does the file exist? %d\n", exists); /* Outputs 'Does the file exist? 1' which means it does exist. */
return exists;
}
int func(const char *fname)
{
FILE *file;
if (file = fopen(fname, "r"))
{
fclose(file);
return 1;
}
return 0;
}
it will output 'Does the file exist? 1'. Which means it does indeed exist. But the problem is I need to be able to handle universal character. So how do I unescape a string which contains universal character names?
Thanks in advance.
I'm reediting the answer in the hope to make it clearer. First of all I'm assuming you are familiar with this: http://www.joelonsoftware.com/articles/Unicode.html. It is required background knowledge when dealing with character encoding.
Now I'm starting with a simple test program I typed on my linux machine test.c
#include <stdio.h>
#include <string.h>
#include <wchar.h>
#define BUF_SZ 255
void test_fwrite_universal(const char *fname)
{
printf("test_fwrite_universal on %s\n", fname);
printf("In memory we have %d bytes: ", strlen(fname));
for (unsigned i=0; i<strlen(fname); ++i) {
printf("%x ", (unsigned char)fname[i]);
}
printf("\n");
FILE* file = fopen(fname, "w");
if (file) {
fwrite((const void*)fname, 1, strlen(fname), file);
fclose(file);
file = NULL;
printf("Wrote to file successfully\n");
}
}
int main()
{
test_fwrite_universal("file_\u00e5.txt");
test_fwrite_universal("file_å.txt");
test_fwrite_universal("file_\u0436.txt");
return 0;
}
the text file is encoded as UTF-8. On my linux machine my locale is en_US.UTF-8
So I compile and run the program like this:
gcc -std=c99 test.c -fexec-charset=UTF-8 -o test
test
test_fwrite_universal on file_å.txt
In memory we have 11 bytes: 66 69 6c 65 5f c3 a5 2e 74 78 74
Wrote to file successfully
test_fwrite_universal on file_å.txt
In memory we have 11 bytes: 66 69 6c 65 5f c3 a5 2e 74 78 74
Wrote to file successfully
test_fwrite_universal on file_ж.txt
In memory we have 11 bytes: 66 69 6c 65 5f d0 b6 2e 74 78 74
Wrote to file successfully
The text file is in UTF-8, my locale is working of of UTF-8 and the execution character set for char is UTF-8.
In main I call the function fwrite 3 times with character strings. The function prints the strings byte by byte. Then writes a file with that name and write that string into the file.
We can see that "file_\u00e5.txt" and "file_å.txt" are the same: 66 69 6c 65 5f c3 a5 2e 74 78 74
and sure enough (http://www.fileformat.info/info/unicode/char/e5/index.htm) the UTF-8 representation for code point +00E5 is: c3 a5
In the last example I used \u0436 which is a Russian character ж (UTF-8 d0 b6)
Now lets try the same on my windows machine. Here I use mingw and I execute the same code:
C:\test>gcc -std=c99 test.c -fexec-charset=UTF-8 -o test.exe
C:\test>test
test_fwrite_universal on file_å.txt
In memory we have 11 bytes: 66 69 6c 65 5f c3 a5 2e 74 78 74
Wrote to file successfully
test_fwrite_universal on file_å.txt
In memory we have 11 bytes: 66 69 6c 65 5f c3 a5 2e 74 78 74
Wrote to file successfully
test_fwrite_universal on file_╨╢.txt
In memory we have 11 bytes: 66 69 6c 65 5f d0 b6 2e 74 78 74
Wrote to file successfully
So it looks like something went horribly wrong printf is not writing the characters properly and the files on the disk also look wrong.
Two things worth noting: in terms of byte values the file name is the same in both linux and windows. The content of the file is also correct when opened with something like notepad++
The reason for the problem is the C Standard library on windows and the locale. Where on linux the system locale is UTF-8 on windows my default locale is CP-437. And when I call functions such as printf fopen it assumes the input is in CP-437 and there c3 a5 are actually two characters.
Before we look at a proper windows solution lets try to explain why you have different results in file_å.txt vs file_\u00e5.txt.
I believe the key is the encoding of your text file. If I write the same test.c in CP-437:
C:\test>iconv -f UTF-8 -t cp437 test.c > test_lcl.c
C:\test>gcc -std=c99 test_lcl.c -fexec-charset=UTF-8 -o test_lcl.exe
C:\test>test_lcl
test_fwrite_universal on file_å.txt
In memory we have 11 bytes: 66 69 6c 65 5f c3 a5 2e 74 78 74
Wrote to file successfully
test_fwrite_universal on file_å.txt
In memory we have 10 bytes: 66 69 6c 65 5f 86 2e 74 78 74
Wrote to file successfully
test_fwrite_universal on file_╨╢.txt
In memory we have 11 bytes: 66 69 6c 65 5f d0 b6 2e 74 78 74
Wrote to file successfully
I now get a difference between file_å and file_\u00e5. The character å in the file is actually encoded as 0x86. Notice that this time the second string is 10 characters long not 11.
If we look at the file and tell Notepad++ to use UTF-8 we will see a funny result. Same goes to the actual data written to the file.
Finally how to get the damn thing working on windows. Unfortunately It seems that it is impossible to use the standard library with UTF-8 encoded strings. On windows you can't set the C locale to that. see: What is the Windows equivalent for en_US.UTF-8 locale?.
However we can work around this with wide characters:
#include <stdio.h>
#include <string.h>
#include <windows.h>
#define BUF_SZ 255
void test_fopen_windows(const char *fname)
{
wchar_t buf[BUF_SZ] = {0};
int sz = MultiByteToWideChar(CP_UTF8, 0, fname, strlen(fname), (LPWSTR)buf, BUF_SZ-1);
wprintf(L"converted %d characters\n", sz);
wprintf(L"Converting to wide characters %s\n", buf);
FILE* file =_wfopen(buf, L"w");
if (file) {
fwrite((const void*)fname, 1, strlen(fname), file);
fclose(file);
wprintf(L"Wrote file %s successfully\n", buf);
}
}
int main()
{
test_fopen_windows("file_\u00e5.txt");
return 0;
}
To compile use:
gcc -std=gnu99 -fexec-charset=UTF-8 test_wide.c -o test_wide.exe
_wfopen is not ANSI compliant and -std=c99 actually means STRICT_ANSI so you should use gnu99 to have that function.
Wrong array size (forgot the .txt and \0 and that an encoded non-ASCII char takes up more than 1 byte.)
// length of the string without the universal character name.
// C:/Users/Familjen-Styren/Documents/Vågformer/20140104-0002/text
// 123456789012345678901234567890123456789012345678901234567890123
// 1 2 3 4 5 6
// int len = 63;
// C:/Users/Familjen-Styren/Documents/Vågformer/20140104-0002/text.txt
int len = 100;
char *src = "C:/Users/Familjen-Styren/Documents/V\u00E5gformer/20140104-0002/text.txt";
char fname[len];
// or if you can use VLA
char fname[strlen(src)+1];
sprintf(fname, "%s", src);

Resources