I'm attempting to read from a Project Gutenberg text file and count the total number of words. I'm currently overshooting because words with apostrophes are double counted. However the apostrophe character from the text file doesn't match the ASCII character 39, i.e. '\'', so my is_word function is working incorrectly. Any suggestion as to what that character actually is?
Note: When I go through and manually replace the apostrophes in vim, the word counter works fine.
link to text file: http://www.gutenberg.org/ebooks/1342
This isn't a complete answer, but if you do
#include <wchar.h>
#include <locale.h>
and then
setlocale(LC_ALL, "en_US.UTF-8");
and then call getwchar() or getwc(fp) instead of getchar/getc, and then check for the value 8217 as well as '\'', you might be able to get it all to work.
(It works for me. YMMV. Depending on your OS, you might have to use a locale string other than "en_US.UTF-8".)
(And if this does work, welcome to the wonderful world of internationalization. Having gone down this road, there are several other issues you'll have to pay attention to if you want your code to work properly under all circumstances and in all locales.)
Related
I'm making a code and as im progressing the encoding changes to UTF-8, but that created a problem for me, im brazilian and i have some phrases in portuguese with special characters that are in ASCII table, but having to revise every printf and every phrase or word to see if have a special character is madness in a 700 line code, i have a short time so i tried changing the encoding to ISO-8859-1,UNICODE and WINDOWS-1252 but the moment when i build or save the file it returns to UTF-8, i tried changing the setlocale(LC_ALL,"pt_BR.utf8") or anything but nothing happens, i tought that was the Code::Blocks terminal that was broken then i made a new test file to see with WINDOWS-1252 encoding worked, anyone has any ideia to help or i'd have to make character by character?
Im using the default terminal of codeblocks cb_console_runner
Isn't the encoding UTF-8 enconding and bytes that is incompatible with special characters? Because the default in UNICODE is 16bytes or am i wrong?
EDIT:
#include <stdio.h>
#include <locale.h>
int main(){
printf("%s", setlocale(LC_ALL,"pt_BR.utf8"));
}
returned: (NULL)
#include <stdio.h>
#include <locale.h>
int main(){
printf("%s", setlocale(LC_ALL,""));
}
returned: Portuguese_Brazil.1252
as i looked in previewed questions in portuguese stackoverflow none has helped at all, some says is the encoding, others says is the terminal.
So, yesterday i talked to my professor and we both agreed that was the encoding, but he had an idea, i got my code and opened in Dev-C++ and as we noticed the file was "corrupted" with the special letters as i mentioned i think it was from when the file was been saved in UTF8 that changed.
I'm new to C and I came across this code and it was confusing me:
sprintf(banner1, "\e[37╔═╗\e[37┌─┐\e[37┌┐┌\e[37┌─┐\e[37┌─┐\e[37┌─┐\e[37┌─┐\e[37m\r\n");
sprintf(banner2, "\e[37╠═╝\e[37├─┤\e[37│││\e[37│ ┬\e[37├─┤\e[37├┤\e[37 ├─┤\e[37m\r\n");
sprintf(banner3, "\e[37╩ \e[37┴ ┴┘\e[37└┘\e[37└─┘\e[37┴ ┴\e[37└─┘\e[37┴ ┴\e[37m\r\n");
I was just confused as I don't know what do \e[37 and \r\n mean. And can I change the colors?
This looks like an attempt to use ANSI terminal color escapes and Unicode box drawing characters to write the word "PANGAEA" in a large, stylized, colorful manner. I'm guessing it's part of a retro-style BBS or MUD system, intended to be interacted with over telnet or ssh. It doesn't work, because whoever wrote it made a bunch of mistakes. Here's a corrected, self-contained program:
#include <stdio.h>
int main(void)
{
printf("\e[31m╔═╗\e[32m┌─┐ \e[33m┌┐┌\e[34m┌─┐\e[35m┌─┐\e[36m┌─┐\e[37m┌─┐\e[0m\n");
printf("\e[31m╠═╝\e[32m├─┤ \e[33m│││\e[34m│ ┬\e[35m├─┤\e[36m├┤ \e[37m├─┤\e[0m\n");
printf("\e[31m╩ \e[32m┴ ┴┘\e[33m┘└┘\e[34m└─┘\e[35m┴ ┴\e[36m└─┘\e[37m┴ ┴\e[0m\n");
return 0;
}
The mistakes were: using \r\n instead of plain \n, leaving out the m at the end of each and every escape sequence, and a number of typos in the actual letters (missing spaces and the like).
I deliberately changed sprintf(bannerN, ... to printf to make it a self-contained program instead of a fragment of a larger system, and changed the actual color codes used for each letter to make it a more interesting demo. When I run this program on my computer I get this output:
The program will only work on your computer if your terminal emulator supports both ANSI color escapes and printing UTF-8 with no special ceremony. Most Unix-style operating systems nowadays support both by default; I don't know about Windows.
I am doing a basic C tutorial. In an example this code was given to introduce escape sequences:
#include <stdio.h>
int main()
{
printf("This is a \"sample text\"\n");
printf("\tMore text\n");
printf("This is getting overwritten\r");
printf("By this, another sample text\n");
printf("The spa \bce is removed.\n");
return 0;
}
The console output is expected to look like this:
This is a "sample text"
More text
By this, another sample text
The space is removed.
Instead, I get this:
This is a "sample text"
More text
This is getting overwritten
By this, another sample text
The spa ce is removed.
I am using Eclipse Cpp Oxygen on Windows and the Cygwin toolchain to compile und run the code. I don't know what I'm doing wrong and I thought I'd ask here for help.
The console built in to Eclipse does not support the \r, \b (and \f) characters.
There is a long standing bug 76936 for this which has been open for 14 years. But doesn't look like being fixed.
In linux you example works exactly as you expect. Probably in windows the \r is considered like \n.
Instead on linux terminal the \r put (correctly) the cursor on the first char of the row.
I was reading the gcc gnu-online-docs. I am confused about what it mentions regarding a \ or /in a header file name.
It says:
However, if backslashes occur within file, they are considered
ordinary text characters, not escape characters. None of the character
escape sequences appropriate to string constants in C are processed.
Thus, #include "x\n\\y" specifies a filename containing three
backslashes. (Some systems interpret ‘\’ as a pathname separator. All
of these also interpret ‘/’ the same way. It is most portable to use
only ‘/’.)
What does it mean by "some systems" in this paragraph? Does it mean the implementation depends upon the OS - Windows/Linux? (I know in #include <linux/module.h>, / specifies a path)
On Windows, both / and \ function as pathname component separators (dividing the name of a directory from the name of something within the directory). On basically all other operating systems in common use today, only / serves this function.1
By taking \ literally in header-file names, instead of an escape character as it is in normal strings, GCC's preprocessor accommodates Windows-specific code written with backslashes, e.g.
#include <sys\types.h>
where it is exceedingly unlikely that the programmer intended to include a file whose name is 'sys ypes.h' (that blank before the 'y' is a hard tab). The text in the manual is intended to inform you that such code will not work if moved to a Unixy system, but that if you write it with a forward slash instead, it will work on Windows.
I happen to have written the paragraph you quote, but that was more than ten years ago now, and I don't remember why I didn't use the word "Windows".
1 VMS and some IBM mainframe OSes have entirely different pathname syntax, but these have never been well-supported by GCC, and it is my understanding that surviving installations tend to have a POSIX compatibility layer installed anyway.
Remember that in ordinary strings, \n is a newline, \t is a tab, \a is an alert, etc.
The text means that if you write:
#include <sys\alert.h>
the \a sequence is treated as two characters, backslash and 'a', and not as a single character 'alert'. That is, the file is called alert.h and is found in a directory sys somewhere in one of the directories that the compiler searches for headers. Normally, inside a string, "sys\alert.h" would mean a name 's', 'y', 's', backspace, 'l', 'e', 'r', 't', '.', 'h'.
Similarly for:
#include <sys\backtrack.h>
#include <sys\file.h>
#include <sys\newton.h>
#include <sys\register.h>
#include <sys\time.h>
#include <sys\vtable.h>
(where I made up names as seemed to be convenient, and the sys can be replaced by any other directory name, and the <> by "").
#include <sys\386.h>
#include <sys\xdead.h>
#include <sys\ucafe.h>
#include <sys\U00DEFACED.h>
are also treated as regular strings rather than containing octal, hexadecimal, or Unicode escapes.
Windows is the main system where \ is used as the formal path element separator. However, even on Windows, the API treats / as if it were \.
I'm programming in windows, but in my C console some characters (like é, à, ã) are not recognizable. I would like to see how can I make widows interpret those chars as using unicode in the console or utf-8.
I would be glad for some enlightening.
Thank you very much
By console do you mean cmd.exe? It doesn't handle Unicode well, but you can get it to display "ANSI" characters by changing the display font to Lucida Console and changing the code page from "OEM" to "ANSI." By the choice of characters you seem to be Western European, so try giving this command before running your application:
chcp 1252
If you want to try your luck with UTF-8 output use chcp 65001 instead.
Although I completely agree with Joni's answer, I think it can be added a detail:
Since Telmo Vaz asked about how to solve this problem for C programs, we can consider the alternative of adding a system command inside the code:
#include <stdlib.h> // To use the function system();
#include <stdio.h>
int main(void) {
system("CHCP 1252");
printf("Now accents are right: áéíüñÇ \n");
return 0;
}
EDIT It is a good idea to do some experiments with codepages. Check the following table for information (under Windows):
Windows Codepages