This question already has answers here:
Converting Letters to Numbers in C
(10 answers)
Closed 6 years ago.
Alright so pretty simple, I want to convert a letter to a number so that a = 0, b = 1, etc. Now I know I can do
number = letter + '0';
so when I input the letter 'a' it gives me the number 145. My question is, if I am to run this on a different computer or OS, would it still give me the same number 145 for when I input the letter 'a'?
It depends on what character encoding you are using. If you're using the same encoding and compiler on both the computers, yes, it will be the same. But if you're using another encoding like EBCDIC on one computer and ASCII on another, you cannot guarantee them to be the same.
Also, you can use atoi.
If you do not want to use atoi, see: Converting Letters to Numbers in C
It depends on what character encoding you are using.
It is also important to note that if you use ASCII the value will fit in a byte.
If you are using UTF-8 for example, the value wont fit a byte but you will require two bytes (int16) at least.
Now, lets assume you are making sure you use one specific character encoding then, the value will be the same no matter the system.
Yes, the number used to represent a is defined in the American Standard Code for Information Interchange. This is the standard that C compilers use by default, so on all other OSs you will get the same result.
Related
I've been diving into C/low-level programming/system design recently. As a seasoned Java developer I still remember my attemtps to pass SUN Java Certification and questions if char type in Java can be cast to Integer and how can that be done. That is what I know and remember - numbers up to 255 can be treated both like numbers or characters depending on casting.
Getting to know C I want to know more but I find it hard to find proper answer (tried googling but I usually get gazilion results how just to convert char to int in the code) how does EXACTLY it work, that C compiler/system calls transform number to character and vice versa.
AFAIK in the memory numbers are being stored. So let's assume in the memory cell we store value 65 (which is letter 'A'). So there is a value stored and suddenly C code wants to get it and store into char variable. So far so good. And then we issue printf procedure with %c formatting for given char parameter.
And here is where the magic happens - HOW EXACTLY printf knows that character with value 65 is letter 'A' (and should display it as a letter). It is a base sign from raw ASCII range (not some funny emoji-style UTF sign). Does it call external STD/libraries/system calls to consult encoding system? I would love some nitty-gritty, low-level explanation or at least link to trusted source.
The C language is largely agnostic about the actual encoding of characters. It has a source character set which defines how the compiler treats characters in the source code. So, for instance on an old IBM system the source character set might be EBCDIC where 65 does not represent 'A'.
C also has an execution character set which defines the meaning of characters in the running program. This is the one that seems more pertinent to your question. But it doesn't really affect the behavior of I/O functions like printf. Instead it affects the results of ctype.h functions like isalpha and toupper. printf just treats it as a char sized value which it receives as an int due to variadic functions using default argument promotions (any type smaller than int is promoted to int, and float is promoted to double). printf then shuffles off the same value to the stdout file and then it's somebody else's problem.
If the source character set and execution character set are different, then the compiler will perform the appropriate conversion so the source token 'A' will be manipulated in the running program as the corresponding A from the execution character set. The choice of actual encoding for the two character sets, ie. whether it's ASCII or EBCDIC or something else is implementation defined.
With a console application it is the console or terminal which receives the character value that has to look it up in a font's glyph table to display the correct image of the character.
Character constants are of type int. Except for the fact that it is implementation defined whether char is signed or unsigned, a char can mostly be treated as a narrow integer. The only conversion needed between the two is narrowing or widening (and possibly sign extension).
"HOW EXACTLY printf knows that character with value 65 is letter 'A' (and should display it as a letter)."
It usually doesn't, and it does not even need to. Even the compiler does not see characters ', A and ' in the C language fragment
char a = 'A';
printf("%c", c);
If the source and execution character sets are both ASCII or ASCII-compatible, as is usually the case nowadays, the compiler will have among the stream of bytes the triplet 39, 65, 39 - or rather 00100111 01000001 00100111. And its parser has been programmed with a rule that something between two 00100111s is a character literal, and since 01000001 is not a magic value it is translated as is to the final program.
The C program, at runtime, then handles 01000001 all the time (though from time to time it might be 01000001 zero-extended to an int, e.g. 00000000 00000000 00000000 01000001 on 32-bit systems; adding leading zeroes does not change its numerical value). On some systems, printf - or rather the underlying internal file routines - might translate the character value 01000001 to something else. But on most systems, 01000001 will be passed to the operating system as is. Then on the operating system - or possibly in a GUI program receiving the output from the operating system - will want to display that character, and then the display font is consulted for the glyph that corresponds to 01000001, and usually the glyph for letter 01000001 looks something like
A
And that will be displayed to the user.
At no point does the system really operate with glyphs or characters but just binary numbers. The system in itself is a Chinese room.
The real magic of printf is not how it handles characters, but how it handles numbers, as these are converted to more characters. While %c passes values as-is, %d will convert such a simple integer value as 0b101111000110000101001110 to stream of bytes 0b00110001 0b00110010 0b00110011 0b00110100 0b00110101 0b00110110 0b00110111 0b00111000 so that the display routine will correctly display it as
12345678
char in C is just an integer CHAR_BIT bits long. Usually it is 8 bits long.
HOW EXACTLY printf knows that character with value 65 is letter 'A'
The implementation knows what characters encoding it uses and pritnf function code takes the appropriate action do output the letter 'A'
I'm expecting a single digit integer input, and have error handling in place already if this is not the case. Are the any potential unforeseen consequences by simply subtracting the input character by '0' to "convert" it into an integer?
I'm not looking for opinions on readability or what's more commonly used (although they wouldn't hurt as an extension to the answer), but simply whether or not it's a reliable form of conversion. If I ask the user to input an integer between 0 and 9, is there any scenario in which there can be input that input = input-'0' should handle, but doesn't?
This is safe and guaranteed by the C language. In the current version, C11, the relevant text is 5.2.1 Character sets, ΒΆ3:
In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.
As for whether it's "bad practice", that's a matter of opinion, but I would say no. It's both idiomatic (commonly used and understood by C programmers) and lacks any alternative that's not confusing and inefficient. For example nobody reading C would want to see this written as a switch statement with 10 cases or by setting up a dummy one-character string to pass to atoi.
The order of characters are encoding/system-dependent, so one must not rely on a particular order in general. For the sequence of digits 0..9 in any system, however, it is guaranteed that it starts with 0 and continues to 9 without any intermediate characters. So input = input - '0' is perfect as long as you guarantee that input contains a digit (e.g. by using isdigit).
This question already has answers here:
Where did the name `atoi` come from?
(2 answers)
Closed 5 years ago.
Anyone know what the source of the name of the function atol for converting a string to an long?
I thought about Array To long but it's not sounds to me true.
ASCII To Long is what atol(3) means (in the early days of Unix, ASCII was only used, and IIRC was mentioned in the K&R book)
Today we usually use UTF-8 everywhere, but atol still works (since UTF-8 for digits uses the same encoding than ASCII)
On C implementations using another encoding (e.g. EBCDIC) atol should still do what is expected (so atol("345") would give 345), since the C standard requires that the encoding of digit characters is consecutive. Its implementation might be more complex (or encoding specific).
so today, the atol name don't refer anymore to ASCII. The C11 standard n1570 don't mention ASCII (as mandatory) IIRC. you might rewrite history by reading atol as anything to long even if historically it was ASCII to long.
It's Ascii to long, the same convention is used for atoi etc.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Char to int conversion in C.
I remember learning in a course a long time ago that converting from an ASCII char to an int by subtracting '0' is bad.
For example:
int converted;
char ascii = '8';
converted = ascii - '0';
Why is this considered a bad practice? Is it because some systems don't use ASCII? The question has been bugging me for a long time.
While you probably shouldn't use this as part of a hand rolled strtol (that's what the standard library is for) there is nothing wrong with this technique for converting a single digit to its value. It's simple and clear, even idiomatic. You should, though, add range checking if you are not absolutely certain that the given char is in range.
It's a C language guarantee that this works.
5.2.1/3 says:
In both the source and execution basic character sets, the value of each character after 0 in the above list [includes the sequence: 0,1,2,3,4,5,6,7,8,9] shall be one greater that the value of the previous.
Character sets may exist where this isn't true but they can't be used as either source or execution character sets in any C implementation.
Edit: Apparently the C standard guarantees consecutive 0-9 digits.
ASCII is not guaranteed by the C standard, in effect making it non-portable. You should use a standard library function intended for conversion, such as atoi.
However, if you wish to make assumptions about where you are running (for example, an embedded system where space is at a premium), then by all means use the subtraction method. Even on systems not in the US-ASCII code page (UTF-8, other code pages) this conversion will work. It will work on ebcdic (amazingly).
This is a common trick taught in C classes primarily to illustrate the notion that a char is a number and that its value is different from the corresponding int.
Unfortunately, this educational toy somehow became part of the typical arsenal of most C developers, partially because C doesn't provide a convenient call for this (it is often platform specific, I'm not even sure what it is).
Generally, this code is not portable for non-ASCII platforms, and for future transitions to other encodings. It's also not really readable. At a minimum wrap this trick in a function.
In ASCII, i wonder how is 65 translated to 'A' character?
As far as my knowledge goes, 65 can be represented in binary but 'A' is not. So how could this conversion happen?
Everything in a computer is binary. So a string in C is a sequence of binary values. Obviously that is not much use to humans, so various standards developed, where people decided what numerical values would represent certain letters. In ASCII the value 65 represents the letter A. So the value stored is 65, but everyone knows (because they have read the ASCII spec) that value corresponds to the letter A.
For example, if I am writing the code to display text on the screen, and I receive the value 65, I know to set certain pixels and delete other pixels, so that pixels are arranged like:
#
# #
#####
# #
# #
At no point does my code "really know" that is an "A". It just knows that 65 is displayed as that pattern. Because, as you say, you cannot store letters directly, only binary numbers.
It is just a 'definition'. ASCII defines the relationships between integer values and characters. For implementation, there is a table (you can't see it) that does this translation.
EDIT:
Computers just 0/1. A stream of characters is just a bunch of 0/1 streams: 0110010101... There is a contract between human and computer: 8 bits are represented as a character (okay, there are Unicode, UTF-8 and etc). And, 'A' is 65 and so on.
In C/C++ and any other languages, strings are just handled like integer arrays. Only when you need to display strings, that numbers are 'translated' into character. This translation is done by either hardware or software:
If you write a function that draws character, you're responsible to draw 'A' when the input is 65.
In the past, say that we're in DOS, the computer draws 'A' on the number 65. That relationship is usually stored in the memory. (At that time where no graphics, only text, this table can be tweaked to extend characters. I remember Norton DOS utilities such as NDD/NCD changed this table to draw some special characters that were not in the regular ASCII code.)
You may see this sort of contract or definition in everywhere. For example, assembly code. Your program will be eventually translated into machine code: that is also just a bunch of 0 and 1. But, it is extremely hard to understand when only 0 and 1 are shown. So, there is a rule: say 101010 means "add", 1100 means "mov". That's why we can program like "add eax, 1", and it'll be ultimately decoded into 0/1s.
'A' IS 65. It's just that your display device knows that it should display the value 65 as an A when it renders that value as a character.
The ASCII table is just an agreed upon map of values and characters.
When the computer is instructed to write a character represented by a number to the screen it just finds the numbers corresponding image. The image doesn't make any sense to the computer, it could be an image that looks like an 'A' or a snowman to the user.
So how could this conversion happen?
This conversion is merely called character encoding. The computer only understands bytes and humans (on average =) ) only understands characters. The computer has roughly said a mapping of all bytes and all characters which belongs to those bytes so that it can present the data in a human friendly manner. It's all software based (thus not hardware based). The operating system is usually the one who takes care about this.
ASCII is one of the oldest character encodings. Nowadays we should be all on UTF-8 to avoid Mojibake.
Everything in a computer is stored as a number. It's how software interprets those numbers that's important.
ASCII is a standard that maps the number 65 to the letter 'A'. They could have chosen 66 or 14 to represent 'A', but they didn't. It's almost arbitrary.
So if you have the number 65 sitting in computer memory somewhere, a piece of code that treats that piece of memory as ASCII will map the 65 to 'A'. Another piece of code that treats that memory as an entirely different format may translate it to something else entirely.
The code for converting the ASCII value entered to corresponding character is
int a;
printf("enter the ASCII value : ");
scanf("%d",&a);
printf("%d is the ASCII of %c",a,a);
Its based on a lookup table invented back in the 60's.