I have a homework problem. I have to validate the entry of uppercase characters, but am having a problem with the A to Z.
I just put a while (c<65 || c>90) and it works fine. But, in my country we, use Ñ too, so that is my problem. I tried to use the ascii code 165 to validate the entry but it didn't work.
The char range is from -128 to 127, so for the extended ASCII table I need an unsigned char right?
i tried this:
int main (){
unsinged char n;
//scanf("%c",&n);
printf("%c",n);
return 0;
}
Print the 165 if it scans a 'Ñ'.
The next one:
unsigned char n;
n='Ñ';
printf("%d",n);
pPrints 209.
So I try to validate with 165 and 209 and neither works.
Why does this happen? What can I do to validate the entry of this character?
its works when i use unsigned char and validate with 165. But when i used cmd to try it by reading a txt file, didn't work...
print the 165 if i scan a 'Ñ'.
This means that in your system the character 'Ñ' has code equal to 165, as in the usual extension ISO 8859-1 extension of ASCII.
printf("%d",'Ñ');
print 209.
This reveals a different encoding for the characters you enter manually in your IDE.
Mark Tolonen has suggested that it corresponds to OEM cp437.
(I originally associated to UTF-8 by I'm a little confused now...)
IN C you have to take in account the existence of two collating sequence for characters, that could be different:
The source character set.
The execution character set.
The source character set is referred to the encoding used by your editing environment, that is, the place where you normally type your .c files. Your system and/or editor and/or IDE is working with a specific encoding-schema. In this case, it seems that the encoding is UTF-8.
Thus, if you write 'Ñ' in your editor, the character Ñ has the encoding of your editor, and has not the encoding of the target system. In this case you have Ñ encoded as 209, which gives you 'Ñ' == 209 as true.
The execution character set is referred to the encoding using in the operative system and/or the console you are using to run your executable (that is, compiled) programs. It seems clear that the encoding is Latin 1 (ISO-8859-1).
In particular, when you type Ñ in the console of your system, it's encoded as 165, which gives you the value 165 when you print the value.
Since this dichotomy always can happen (or not), you must be warried about that, and make some adjustments, to avoid potential problems.
its works when i use unsigned char and validate with 165. But when i used cmd to try it by reading a txt file, didn't work...
This means that your .txt file has been written with a text editor (perhaps your own IDE, I guess), that is using an encoding different to Latin 1 (ISO-8859-1).
Let me guess: You are writting your C code and your text files with the same IDE, but you are executing programs from the Windows CMD.
There are two possible solutions here.
The complicated solution is that you investigate about encoding schemas, locale issues, and wide characters. There is not quick solutions here, because it needs to be careful about several delicate stuff.
The easy solution is to make adjustments in all the tools you are using.
Go to the options of your IDE and try to obtain the information of the encoding schema used to save text files (I guessed you have UTF-8, but you can have there other possibilities, like LATIN 1 (or ISO-8859-1), UTF-16 and a large etc.):
Execute in your CMD the command CHCP to obtain the codepage number that your system is using. This codepage is a number whose meaning is explained my Microsoft, here:
a. OEM codepages
b. Windows codepages
c. ISO codepages
d. LIST OF ALL WINDOWS CODEPAGES
I guess you have codepage 850 or well 28591 (corresponding to Latin 1).
Change one of these configurations to fit with the other one.
a. In the configuration of your IDE, in the "editor options" part, you could change the encoding to something like Latin 1, or ISO-8859-1.
b. Or well, better change the codepage in your CMD, by means of the CHCP command, to fit OEM 437 encoding:
CHCP 437
Probably the solution involving the change of codepage in CMD not always work as one expected.
It's safer the solution (a.): to modify the configuration of your editor.
However, it would be prefirable to keep the UTF-8 in your editor (if this is your editor's choice), because nowadays every modern software is turning to UTF encodings (Unicode).
New info: The UTF-8 encoding sometimes uses more than 1 byte to represent 1 character. The following table shows the UTF-8 encoding for the first 256 entry points:
UTF-8 for U+0000 to U+00FF
Note: After a little discussion in the comments, I realized that I had some wrong believes about UTF-8 encoding. At least, this illustrate my point: encoding is not a trivial matter.
So, I have to repeat here my advice to the OP: go by the simplest path and try to achieve to an agreement with your teacher about how to handle encoding for special characters.
165 is not an ASCII code. ASCII goes from 0 to 127. 165 is a code in some other character set. In any case, char must be used for scanf and you can convert the value to unsigned char after that. Alternatively, use getchar() which returns a value in the range of unsigned char already.
You should use the standard function isalpha from ctype.h:
int n = getchar();
if ( isalpha(n) )
{
// do something...
}
You probably also will have to set a locale in which this character is a letter, e.g. setlocale( LC_CTYPE, "es_ES");
Related
I know there are a few similar questions around relating to this, but it's still not completely clear.
For example: If in my C source file, I have lots of defined string literals, as the compiler is translating this source file, does it go through each character of strings and use a look-up table to get the ascii number for each character?
I'd guess that if entering characters dynamically into a a running C program from standard input, it is the terminal that is translating actual characters to numbers, but then if we have in the code for example :
if (ch == 'c'){//.. do something}
the compiler must have its own way of understanding and mapping the characters to numbers?
Thanks in advance for some help with my confusion.
The C standard talks about the source character set, which the set of characters it expects to find in the source files, and the execution character set, which is set of characters used natively by the target platform.
For most modern computers that you're likely to encounter, the source and execution character sets will be the same.
A line like if (ch == 'c') will be stored in the source file as a sequence of values from the source character set. For the 'c' part, the representation is likely 0x27 0x63 0x27, where the 0x27s represent the single quote marks and the 0x63 represents the letter c.
If the execution character set of the platform is the same as the source character set, then there's no need to translate the 0x63 to some other value. It can just use it directly.
If, however, the execution character set of the target is different (e.g., maybe you're cross-compiling for an IBM mainframe that still uses EBCDIC), then, yes, it will need a way to look up the 0x63 it finds in the source file to map it to the actual value for a c used in the target character set.
Outside the scope of what's defined by the standard, there's the distinction between character set and encoding. While a character set tells you what characters can be represented (and what their values are), the encoding tells you how those values are stored in a file.
For "plain ASCII" text, the encoding is typically the identity function: A c has the value 0x63, and it's encoded in the file simply as a byte with the value of 0x63.
Once you get beyond ASCII, though, there can be more complex encodings. For example, if your character set is Unicode, the encoding might be UTF-8, UTF-16, or UTF-32, which represent different ways to store a sequence of Unicode values (code points) in a file.
So if your source file uses a non-trivial encoding, the compiler will have to have an algorithm and/or a lookup table to convert the values it reads from the source file into the source character set before it actually does any parsing.
On most modern systems, the source character set is typically Unicode (or a subset of Unicode). On Unix-derived systems, the source file encoding is typically UTF-8. On Windows, the source encoding might be based on a code page, UTF-8, or UTF-16, depending on the code editor used to create the source file.
On many modern systems, the execution character set is also Unicode, but, on an older or less powerful computer (e.g., an embedded system), it might be restricted to ASCII or the characters within a particular code page.
Edited to address follow-on question in the comments
Any tool that reads text files (e.g., an editor or a compiler) has three options: (1) assume the encoding, (2) take an educated guess, or (3) require the user to specify it.
Most unix utilities assume UTF-8 because UTF-8 is ubiquitous in that world.
Windows tools usually check for a Unicode byte-order mark (BOM), which can indicate UTF-16 or UTF-8. If there's no BOM, it might apply some heuristics (IsTextUnicode) to guess the encoding, or it might just assume the file is in the user's current code page.
For files that have only characters from ASCII, guessing wrong usually isn't fatal. UTF-8 was designed to be compatible with plain ASCII files. (In fact, every ASCII file is a valid UTF-8 file.) Also many common code pages are supersets of ASCII, so a plain ASCII file will be interpreted correctly. It would be bad to guess UTF-16 or UTF-32 for plain ASCII, but that's unlikely given how the heuristics work.
Regular compilers don't expend much code dealing with all of this. The host environment can handle many of the details. A cross-compiler (one that runs on one platform to make a binary that runs on a different platform) might have to deal with mapping between character sets and encodings.
Sort of. Except you can drop the ASCII bit, in full generality at least.
The mapping used between int literals like 'c' and the numeric equivalent is a function of the encoding used by the architecture that the compiler is targeting. ASCII is one such encoding, but there are others, and the C standard places only minimal requirements on the encoding, an important one being that '0' through to '9' must be consecutive, in one block, positive and able to fit into a char. Another requirement is that 'A' to 'Z' and 'a' to 'z' must be positive values that can fit into a char.
No, the compiler is not required to have such a thing. Think a minute about a pre-C11 compiler, reading EBCDIC source and translating for an EBCDIC machine. What use would have an ASCII look-up table in such a compiler?
Also think another minute about how such ASCII look-up table(s) would look like in such a compiler!
I have try to check an importance and reason to use W winapi vs A, (W meaning wide char, A meaning ascii right?)
I have made a simple example, i receive a temp path for current user like this:
CHAR pszUserTempPathA[MAX_PATH] = { 0 };
WCHAR pwszUserTempPathW[MAX_PATH] = { 0 };
GetTempPathA(MAX_PATH - 1, pszUserTempPathA);
GetTempPathW(MAX_PATH - 1, pwszUserTempPathW);
printf("pathA=%s\r\npathW=%ws\r\n",pszUserTempPathA,pwszUserTempPathW);
My current user has a russian name, so its written in cyrillic, printf outputs like this:
pathA=C:\users\Пыщь\Local\Temp
pathW=C:\users\Пыщь\Local\Temp
So both paths are allright, i thought i will receive some error, or a mess of symbols with a GetTempPathA since the current user is a unicode, but i figured out, that cyrillic characters are actually included in extended ascii character set. So i have a question, if i were to use my software, and it will extract data in a temp folder of current user, who is chinese ( assuming he have chinese symbols in user name ), will i get a mess or an error using the GetTempPathA version? Should i always use a W prefixed functions, for a production software, that is working with winapi directly?
First, the -A suffix stands for ANSI, not ASCII. ASCII is a 7-bit character set. ANSI, as Microsoft uses the term, is for an encoding using 8-bit code units (chars) and code pages.
Some people use the terms "extended ASCII" or "high ASCII," but that's not actually a standard and, in some cases, isn't quite the same as ANSI. Extended ASCII is the ASCII character set plus (at most) 128 additional characters. For many ANSI code pages this is identical to extended ASCII, but some code pages accommodate variable length characters (which Microsoft calls multi-byte). Some people consider "extended ASCII" to just mean ISO-Latin-1 (which is nearly identical to Windows-1252).
Anyway, with an ANSI function, your string can include any characters from your current code page. If you need characters that aren't part of your current code page, you're out-of-luck. You'll have to use the wide -W versions.
In modern versions of Windows, you can generally think of the -A functions as wrappers around the -W functions that use MultiByteToWideChar and/or WideCharToMultiByte to convert any strings passing through the API. But the latter conversion can be lossy, since wide character strings might include characters that your multibyte strings cannot represent.
Portable, cross-platform code often stores all text in UTF-8, which uses 8-bit code units (chars) but can represent any Unicode code point, and anytime text needs to go through a Windows API, you'd explicitly convert to/from wide chars and then call the -W version of the API.
UTF-8 is nearly similar to what Microsoft calls a multibyte ANSI code page, except that Windows does not completely support a UTF-8 code page. There is CP_UTF8, but it works only with certain APIs (like WideCharToMultiByte and MultiByteToWideChar). You cannot set your code page to CP_UTF8 and expect the general -A APIs to do the right thing.
As you try to test things, be aware that it's difficult (and sometimes impossible) to get the CMD console window to display characters outside the current code page. If you want to display multi-script strings, you probably should write a GUI application and/or use the debugger to inspect the actual content of the strings.
Of course, you need the wide version. ASCII version can't even technically handle more than 256 distinct characters. Cyrillic is included in the extended ASCII set (if that's your localization) while Chinese isn't and can't due to much larger set of characters needed to represent it. Moreover, you can get mess with Cyrillic as well - it will only work properly if the executing machine has matching localization. So on a machine with non-cyrillic localization the text will be displayed according to whatever is defined by the localization settings.
I just started to learn C and then want to proceed to learn C++. I am currently using a textbook and just write the examples in order to get a bit more familiar with the programming language and procedure.
Since the example that is given in the book didn't work, I tried to find other similar codes. The problem is that after compiling the code, the program itself does not show and of the symbols represented by %c. I get symbols for the numbers 33-126 but everything else is either nothing at all or just a white block...
Also, on some previous example I wanted to write °C for temperature and it couldn't display the symbol °
The example I found on the web that does not display the %c symbols is
#include <stdio.h>
#include <ctype.h>
int main()
{
int i;
i=0;
do
{
printf("%i %c \n",i,i);
i++;
}
while(i<=255);
}
Is anyone familiar with this? Why can I not get an output for %c or e.g. ° as well???
ASCII is a 7-bit character set, which means it consists of only codepoints in the range [0, 127]. For 8-bit code pages there are still 128 available codepoints with values from 128 to 255 (i.e. the high bit is set). These are sometimes called extended ASCII (although they're not related to ASCII at all) and the characters that they map to depend on the character set. An 8-bit charset is sometimes also called ANSI although it's actually a misnomer
US English Windows uses Windows-1252 code page by default, with the character ° at codepoint 0xB0. Other OSes/languages may use different character sets which have different codepoint for ° or possibly no ° symbol at all.
You have many solutions to this:
If your PC uses an 8-bit charset
Lookup the value of ° in the charset your computer is using and print it normally. For example if you're using CP437 then printf("\xF8") will work because ° is at the code point 0xF8. printf("°") also works if you save the source file in the same code page (CP437)
Or just change charset to Windows-1252/ISO 8859-1 and print '°' or '\xB0'. This can be done programmatically (using SetConsoleOutputCP on Windows and similar APIs on other OSes) or manually (by some console settings, or by running chcp 1252 in Windows cmd). The source code file still needs to be saved in the same code page
Print Unicode. This is the recommended way to do
Linux/Unix and most other modern OSes use UTF-8, so just output the correct UTF-8 string and you don't need to care about anything. However because ° is a multibyte sequence in UTF-8, you must print it as a string. That means you need to use %s instead of %c. A single char can't represent ° in UTF-8. Newer Windows 10 also supports UTF-8 as a locale so you can print the UTF-8 string directly
On older Windows you need to print the string out as UTF-16. It's a little bit tricky but not impossible
If you use "\u00B0" and it prints out successfully then it means your terminal is already using UTF-8. \u is the escape sequence for arbitrary Unicode code points
See also
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Anything outside the range 33-126 isn't a visible ASCII character. 0-32 is stuff like backspace (8), "device control 2" (18), and space (32). 127 is DEL, and anything past that isn't even ASCII; who knows how your terminal will handle that.
Ok, I have this:
AllocConsole();
SetConsoleOutputCP(CP_UTF8);
HANDLE consoleHandle = GetStdHandle(STD_OUTPUT_HANDLE);
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 10, NULL, NULL);
WriteConsoleW(consoleHandle, L"wΕλληνικά\n", 10, NULL, NULL);
printf("aΕλληνικά\n");
wprintf(L"wΕλληνικά\n");
Now, the issue is that depending on the encoding file was saved as only some these works. wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters). Yet, I have issue with three others. If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
WriteConsoleA(consoleHandle, "aΕλληνικά\n", 18, NULL, NULL);
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
Now, let's go to UTF-16 (Unicode - Codepage 1200). Works only WriteConsoleW. Character count in WriteConsoleA should be 10 to keep formatting precise.
Saving in UTF-16 Big Endian mode (Unicode - Codepage 1201) does not change anything. Again, WTF? Shouldn't byte order inside the strings be inverted when stored to file?
Conclusion is that the way strings are compiled into binary form depends on the encoding used. Therefore, what is the portable and compiler independent way to store strings? Is there a preprocessor which would convert one string representation into another before compilation, so I could store file in UTF-8 and only preprocess strings which I need to have in UTF-16 by wrapping them some macro.
I think you've got at least a few assumptions here which are either wrong or not 100% correct as far as I know:
Now, the issue is that depending on the encoding file was saved as only some these works.
Of course, because the encoding determines how to Interpret the string literals.
wprintf never works, but I already know why (broken Microsoft stdout implementation, which only accepts narrow characters).
I've never heard of that one, but I'm rather sure this depends on the locale set for your program. I've got a few work Projects where a locale is set and the output is just fine using German umlauts etc.
If I save file as UTF-8 without signature (BOM) and use MS Visual C++ compiler, only last printf works. If I want ANSI version working I need to increase character(?) count to 18:
That's because the ANSI version wants an ANSI string, while you're passing a UTF-8 encoded string (based on the file's encoding). The output still works, because the console handles the UTF-8 conversion for you - you're essentially printing raw UTF-8 here.
WriteConsoleW does not work, I assume, because the string is saved as UTF-8 byte sequence even I explicitly request it to be stored as wide-char (UTF-16) with L prefix and implementation most probably expects UTF-16 encoded string not UTF-8.
I don't think so (although I'm not sure why it isn't working either). Have you tried Setting some easy to find string and look for it in the resulting binary? I'm rather sure it's indeed encoded using UTF-16. I assume due to the missing BOM the compiler might interpret the whole thing as a narrow string and therefore converts the UTF-8 stuff wrong.
If I save it in UTF-8 with BOM (as it should be), then WriteConsoleW starts to work somehow (???) and everything else stops (I get ? instead of a character). I need to decrease character count in WriteConsoleA back to 10 to keep formatting the same (otherwise i get 8 additional rectangles). Basically, WTF?
This is exactly what I described above. Now the wide string is encoded properly, because the Compiler now knows the file is in UTF-8, not ANSI (or some codepage). The narrow string is properly converted to the locale being used as well.
Overall, there's no encoding independant way to do it, unless you escape everything using the proper codepage and/or UTF codes in advance. I'd just stick to UTF-8 with BOM, because I think all current compilers will be able to properly read and Interpret the file (besides Microsoft's Resource Compiler; although I haven't tried feeding the 2012 Version with UTF-8).
Edit:
To use an analogy:
You're essentially saving a raw image to a file and you expect it to work properly, no matter whether other programs try to read it as a grayscale, palettized, or full color image. This won't work (despite differences being smaller).
The answer is here.
Quoting:
It is impossible for the compiler to intermix UTF-8 and UTF-16
strings into the compiled output! So you have to decide for one source
code file:
either use UTF-8 with BOM and generate UTF-16 strings only (i.e.always use L prefix),
or UTF-8 without BOM and generate UTF-8 strings only (i.e. never use L prefix),
7-bit ASCII characters are not involved and can be used with or without L prefix
The only portable and compiler independent way is to use ASCII charset and escape sequences, because there are no guarantees that any compiler would accept UTF-8 encoded file and a compiler treatment of those multibyte sequences might vary.
Is it possible to know if a file has Unicode (16-byte per char) or 8-bit ASCII content?
You may be able to read a byte-order-mark, if the file has this present.
UTF-16 characters are all at least 16-bits, with some being 32-bits with the right prefix (0xE000 to 0xFFFF). So simply scanning each char to see if less than 128 won't work. For example, the two bytes 0x20 0x20 would encode in ASCII and UTF-8 for two spaces, but encode in UTF-16 for a single character 0x2020 (dagger). If the text is known to be English with the occasional non-ASCII character, then most every other byte will be zero. But without some apriori knowledge about the text and/or it's encoding, there is no reliable way distinguish a general ASCII string from a general UTF-16 string.
Ditto to what Brian Agnew said about reading the byte order mark, a special two bytes that might appear at the beginning of the file.
You can also know if it is ASCII by scanning every byte in the file and seeing if they are all less than 128. If they are all less than 128, then it's just an ASCII file. If some of them are more than 128, there is some other encoding in there.
First off, ASCII is 7-bit, so if any byte has its high bit set you know the file isn't ASCII.
The various "common" character sets such as ISO-8859-x, Windows-1252, etc, are 8-bit, so if every other byte is 0, you know that you're dealing with Unicode that only uses the ISO-8859 characters.
You'll run into problems where you're trying to distinguish between Unicode and some encoding such as UTF-8. In this case, almost every byte will have a value, so you can't make an easy decision. You can, as Pascal says do some sort of statistical analysis of the content: Arabic and Ancient Greek probably won't be in the same file. However, this is probably more work than it's worth.
Edit in response to OP's comment:
I think that it will be sufficient to check for the presence of 0-value bytes (ASCII NUL) within your content, and make the choice based on that. The reason being that JavaScript keywords are ASCII, and ASCII is a subset of Unicode. Therefore any Unicode representation of those keywords will consist of one byte containing the ASCII character (low byte), and another containing 0 (the high byte).
My one caveat is that you carefully read the documentation to ensure that their use of the word "Unicode" is correct (I looked at this page to understand the function, did not look any further).
If the file for which you have to solve this problem is long enough each time, and you have some idea what it's supposed to be (say, English text in unicode or English text in ASCII), you can do a simple frequency analysis on the chars and see if the distribution looks like that of ASCII or of unicode.
Unicode is an alphabet, not a encoding. You probably meant UTF-16. There is lot of libraries around (python-chardet comes to mind instantly) to autodetect encoding of text, though they all use heuristics.
To programmatically discern the type of a file -- including, but not limited to the encoding -- the best bet is to use libmagic. BSD-licensed it is part of just about every Unix-system you are about to encounter, but for a lesser ones you can bundle it with your application.
Detecting the mime-type from C, for example, is as simple as:
Magic = magic_open(MAGIC_MIME|MAGIC_ERROR);
mimetype = magic_buffer(Magic, buf, bufsize);
Other languages have their own modules wrapping this library.
Back to your question, here is what I get from file(1) (the command-line interface to libmagic(3)):
% file /tmp/*rdp
/tmp/meow.rdp: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
For your specific use-case, it's very easy to tell. Just scan the file, if you find any NULL ("\0"), it must be UTF-16. JavaScript got to have ASCII chars and they are represented by a leading 0 in UTF-16.