fgetws can't read non-English characters on Linux - c

I have a basic C program that reads some lines from a text file containing hundreds of lines in its working directory. Here is the code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <ctype.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
#include <wctype.h>
#include <unistd.h>
int main(int argc, const char * argv[]) {
srand((unsigned)time(0));
char *nameFileName = "MaleNames.txt";
wchar_t line[100];
wchar_t **nameLines = malloc(sizeof(wchar_t*) * 2000);
int numNameLines = 0;
FILE *nameFile = fopen(nameFileName, "r");
while (fgetws(line, 100, nameFile) != NULL) {
nameLines[numNameLines] = malloc(sizeof(wchar_t) * 100);
wcsncpy(nameLines[numNameLines], line, 100);
numNameLines++;
}
fclose(nameFile);
wchar_t *name = nameLines[rand() % numNameLines];
name[wcslen(name) - 1] = '\0';
wprintf(L"%ls", name);
int i;
for (i = 0; i < numNameLines; i++) {
free(nameLines[i]);
}
free(nameLines);
return 0;
}
It basically reads my text file (defined as a macro, it exists at the working directory) line by line. Rest is irrelevant. It runs perfect and as expected on my Mac (with llvm/Xcode). When I try to compile (nothing fancy, again, gcc main.c) and run it on a Linux server, it either:
Exists with error code 2 (meaning no lines are read).
Reads only first 3 lines from my file with hundreds of lines.
What causes this indeterministic (and incorrect) behavior? I've tried commenting out the first line (random seed) and compile again, it always exits with return code 2.
What is the relation between the random methods and reading a file, and why I'm getting this behavior?
UPDATE: I've fixed malloc to sizeof(wchar_t) * 100 from sizeof(wchar_t) * 50. It didn't change anything. My lines are about 15 characters at most, and there are much less than 2000 lines (it is guaranteed).
UPDATE 2:
I've compiled with -Wall, no issues.
I've compiled with -Werror, no issues.
I've run valgrind didn't find any leaks too.
I've debugged with gdb, it just doesn't enter the while loop (fgetws call returns 0).
UPDATE 3: I'm getting a floating point exception on Linux, as numNameLines is zero.
UPDATE 4: I verify that I have read permissions on MaleNames.txt.
UPDATE 5: I've found that accented, non-English characters (e.g. Â) cause problems while reading lines. fgetws halts on them. I've tried setting locale (both setlocale(LC_ALL, "en.UTF-8"); and setlocale(LC_ALL, "tr.UTF-8"); separately) but didn't work.

fgetws() is attempting to read up to 100 wide characters. The malloc() call in the loop allocates 50 wide characters.
The wcscpy() call copies all the wide characters read. If more than 50 wide characters have been read (including the terminating nul) then wcscpy() will overrun the allocated buffer. That results in undefined behaviour.
Instead of multiplying by 50 in the loop, multiply by 100. Or, better yet, compute the length of string read and use that.
Independently of the above, your code will also overrun a buffer if the file contains more than 2000 lines. Your loop needs to check for that.
A number of the functions in your code can fail, and will return a value to indicate that. Your code is not checking for any such failures.
Your code running under OS X is happenstance. The behaviour is undefined, which means there is potential to fail on any host system, when built with any compiler. Appearing to run correctly on one system, and failing on another system, is actually a valid set of responses to undefined behaviour.

Found the solution. It was all about the locale, from the beginning. After experimenting and hours of research, I've stumbled upon this: http://cboard.cprogramming.com/c-programming/142780-arrays-accented-characters.html#post1066035
#include < locale.h >
setlocale(LC_ALL, "");
Setting locale to empty string solved my problem instantly.

Related

Multiple (randomly chosen) outputs across different launches of the same program. Random characters added when fscanf'ing

Simple program: reads a name and a surname (John Smith) from a .txt file via fscanf, adds spaces, prints the name in the console (just as it's written in the .txt).
If compiled and ran on Win10 via
Microsoft (R) C/C++ Optimizing Compiler Version 19.14.26433 for x86
the following code does not produce the same output for the same input across different .exe launches (no recompiling). For each input it seems to have multiple outputs avaialble, between which the program decides at random.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
char input_file_name[255];
FILE * input_file;
char name[255];
input_file = fopen ("a.txt","r");
do
{
if (strlen(name) != 0 )
name[strlen(name)] = ' ';
fscanf (input_file, "%s", name + strlen(name) * sizeof(char));
}while(!feof(input_file));
fclose (input_file);
printf("Name:%s\n", name);
system("pause");
return 0;
}
I will list a couple of inputs and outputs for them. As not all characters are printable, I will type type them as \ascii_code instead, such as \97 = a.
The most common anomalies are \31 (Unit Separator) added at the very front of the string and \12 (NP form feed, new page) or \17 (device control 1) right before the surname (after the first space).
For "John Smith":
"John Smith" (proper output)
"\31 John Smith"
For "Atoroco Coco"
"Atoroco \12Coco"
"\31 Atoroco \16Coco"
For "Mickey Mouse"
"Mickey Mouse" (proper)
"\31 Mickey\81Mouse" (There is a \32 (space) in the string right before the \81, but the console doesn't show the space?!)
If compiled a different machine (MacOS, compiler unknown) it seems to work properly each time, that is it prints simply the .txt's contents.
Why are there multiple outputs produced, seemingly at random?
Why are these characters (\31, \12 etc) in particular added, and no other?
Your code invokes Undefined Behavior (UB), since it uses name uninitialized. Read more in What is Undefined Behaviour in C?
We will initialize it, and make sure the null terminator is there. Standard string functions, like strlen(), depend on the null terminator to mark the end of the string.
Then, you need to make sure that you read something before you call feof(). Moreover, it's a good idea to check what fscanf() returns, which denotes the number of items read.
Putting all together, we get:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
char input_file_name[255];
FILE * input_file;
char name[255] = "\0"; // initialize it so that is has a null terminator
input_file = fopen ("a.txt","r");
do
{
if (strlen(name) != 0 )
name[strlen(name)] = ' ';
} while (fscanf (input_file, "%s ", name + strlen(name) * sizeof(char)) == 1 && !feof(input_file));
fclose (input_file);
printf("Name:%s\n", name);
return 0;
}
Output (for "georgios samaras"):
georgios samaras

Different execution flow using read() and fgets() in C

I have a sample program that takes in an input from the terminal and executes it in a cloned child in a subshell.
#define _GNU_SOURCE
#include <stdlib.h>
#include <sys/wait.h>
#include <sched.h>
#include <unistd.h>
#include <string.h>
#include <signal.h>
int clone_function(void *arg) {
execl("/bin/sh", "sh", "-c", (char *)arg, (char *)NULL);
}
int main() {
while (1) {
char data[512] = {'\0'};
int n = read(0, data, sizeof(data));
// fgets(data, 512, stdin);
// int n = strlen(data);
if ((strcmp(data, "exit\n") != 0) && n > 1) {
char *line;
char *lines = strdup(data);
while ((line = strsep(&lines, "\n")) != NULL && strcmp(line, "") != 0) {
void *clone_process_stack = malloc(8192);
void *stack_top = clone_process_stack + 8192;
int clone_flags = CLONE_VFORK | CLONE_FS;
clone(clone_function, stack_top, clone_flags | SIGCHLD, (void *)line);
int status;
wait(&status);
free(clone_process_stack);
}
} else {
exit(0);
}
}
return 0;
}
The above code works in an older Linux system (with minimal RAM( but not in a newer one. Not works means that if I type a simple command like "ls" I don't see the output on the console. But with the older system I see it.
Also, if I run the same code on gdb in debugger mode then I see the output printed onto the console in the newer system as well.
In addition, if I use fgets() instead of read() it works as expected in both systems without an issue.
I have been trying to understand the behavior and I couldn't figure it out. I tried doing an strace. The difference I see is that the wait() return has the output of the ls program in the cases it works and nothing for the cases it does not work.
Only thing I can think of is that read(), since its not a library function has undefined behavior across systems. But I can't agree as to how its affecting the output.
Can someone point me out to why I might be observing this behavior?
EDIT
The code is compiled as:
gcc test.c -o test
strace when it's not working as expected is shown below
strace when it's working as expected (only difference is I added a printf("%d\n", n); following the call for read())
Thank you
Shabir
There are multiple problems in your code:
a successful read system call can return any non zero number between 1 and the buffer size depending on the type of handle and available input. It does not stop at newlines like fgets(), so you might get line fragments, multiple lines, or multiple lines and a line fragment.
furthermore, if read fills the whole buffer, as it might when reading from a regular file, there is no trailing null terminator, so passing the buffer to string functions has undefined behavior.
the test if ((strcmp(data, "exit\n") != 0) && n > 1) { is performed in the wrong order: first test if read was successful, and only then test the buffer contents.
you do not set the null terminator after the last byte read by read, relying on buffer initialization, which is wasteful and insufficient if read fills the whole buffer. Instead you should make data one byte longer then the read size argument, and set data[n] = '\0'; if n > 0.
Here are ways to fix the code:
using fgets(), you can remove the line splitting code: just remove initial and trailing white space, ignore empty and comment lines, clone and execute the commands.
using read(), you could just read one byte at a time, collect these into the buffer until you have a complete line, null terminate the buffer and use the same rudimentary parser as above. This approach mimics fgets(), by-passing the buffering performed by the standard streams: it is quite inefficient but avoids reading from handle 0 past the end of the line, thus leaving pending input available for the child process to read.
It looks like 8192 is simply too small a value for stack size on a modern system. execl needs more than that, so you are hitting a stack overflow. Increase the value to 32768 or so and everything should start working again.

fprintf(fp, "%c",10) not behaving as expected

Here's my code:
#include <stdio.h>
#include <stdlib.h>
main(){
FILE* fp = fopen("img.ppm","w");
fprintf(fp,"%c", 10);
fclose(fp);
return 0;
}
for some reason that I am unable to uncover, this writes 2 bytes to the file: "0x0D 0x0A" while the behaviour I would expect is for it to just write "0x0A" which is 10 in decimal. It seems to work fine with every single other value between 0 and 255 included, it just writes one byte to the file. I am completely lost, any help?
Assuming you are using the Windows C runtime library, newline characters are written as \r\n, or 13 10. Which is 0x0D 0x0A. This is the only character that's actually written as two characters (by software compiled using the Windows toolchain).
You need to open the file with fopen("img.ppm","wb") to write binary.

Failing to print characters from file

I am trying to read from a file and for some reason sometimes it works and sometimes I get the most bizarre results ever.
The code:
#include <stdio.h>
#include <string.h>
int main(int argc, char* argv[])
{
FILE *f = fopen("mac_input_off.txt","r");
char c[2] = "";
while ( 0 != fread(c,sizeof(char),1,f) )
{
c[1] = '\0';
printf("%s",c);
}
fclose(f);
}
In windows, visual studio 2013 it works just fine, but in ubuntu linux, on vmware, for some reason it refuses to read the text and reads only the carriage return at the end of the text(encoded with mac os newlines).
This is the text in the file: bbb58bc7a385cf89ee2102d5ea8d7cab.
A possible reason is that the 8th bit in every byte is set to 0 in this text.
Any idea what am I not getting?
EDIT: The funny semi-colon terminating the while loop was removed and yet nothing is fixed... back to the drawing board.
Tried to check if it actually reads things by putting a breakpoint after the 10th line in gdb (my actual knowledge of gdb is meager and I can't seem to get any front-end working) and it does in fact read the characters. It just doesn't want to print them.
Note that in the line:
while ( 0 != fread(c,sizeof(char),1,f) );
there is no loop body because the statement is terminated with a semi-colon.
Remove the ; and the statements enclosed in the following {...} will be executed as the body of the loop. You should then see the contents of the file displayed.

Using chmod in a C program

I have a program where I need to set the permissions of a file (say /home/hello.t) using chmod and I have to read the permissions to be set from a file. For this I first read the permissions into a character array and then try to modify the permissions of the file. But I see that permissions are set in a weird manner.
A sample program I have written:
main()
{
char mode[4]="0777";
char buf[100]="/home/hello.t";
int i;
i = atoi(mode);
if (chmod (buf,i) < 0)
printf("error in chmod");
}
I see that the permissions of the file are not set to 777. Can you please help me out on how to set the permissions of the file after reading the same from a character array.
The atoi() function only translates decimal, not octal.
For octal conversion, use strtol() (or, as Chris Jester-Young points out, strtoul() - though the valid sizes of file permission modes for Unix all fit within 16 bits, and so will never produce a negative long anyway) with either 0 or 8 as the base. Actually, in this context, specifying 8 is best. It allows people to write 777 and get the correct octal value. With a base of 0 specified, the string 777 is decimal (again).
Additionally:
Do not use 'implicit int' return type for main(); be explicit as required by C99 and use int main(void) or int main(int argc, char **argv).
Do not play with chopping trailing nulls off your string.
char mode[4] = "0777";
This prevents C from storing a terminal null - bad! Use:
char mode[] = "0777";
This allocates the 5 bytes needed to store the string with a null terminator.
Report errors on stderr, not stdout.
Report errors with a newline at the end.
It is good practice to include the program name and file name in the error message, and also (as CJY pointed out) to include the system error number and the corresponding string in the output. That requires the <string.h> header (for strerror()) and <errno.h> for errno. Additionally, the exit status of the program should indicate failure when the chmod() operation fails.
Putting all the changes together yields:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/stat.h>
int main(int argc, char **argv)
{
char mode[] = "0777";
char buf[100] = "/home/hello.t";
int i;
i = strtol(mode, 0, 8);
if (chmod (buf,i) < 0)
{
fprintf(stderr, "%s: error in chmod(%s, %s) - %d (%s)\n",
argv[0], buf, mode, errno, strerror(errno));
exit(1);
}
return(0);
}
Be careful with errno; it can change when functions are called. It is safe enough here, but in many scenarios, it is a good idea to capture errno into a local variable and use the local variable in printing operations, etc.
Note too that the code does no error checking on the result of strtol(). In this context, it is safe enough; if the user supplied the value, it would be a bad idea to trust them to get it right.
One last comment: generally, you should not use 777 permission on files (or directories). For files, it means that you don't mind who gets to modify your executable program, or how. This is usually not the case; you do care (or should care) who modifies your programs. Generally, don't make data files executable at all; when files are executable, do not give public write access and look askance at group write access. For directories, public write permission means you do not mind who removes any of the files in the directory (or adds files). Again, occasionally, this may be the correct permission setting to use, but it is very seldom correct. (For directories, it is usually a good idea to use the 'sticky bit' too: 1777 permission is what is typically used on /tmp, for example - but not on MacOS X.)

Resources