Why are hexadecimal numbers prefixed with 0x? - c

Why are hexadecimal numbers prefixed as 0x?
I understand the usage of the prefix but I don't understand the significance of why 0x was chosen.

Short story: The 0 tells the parser it's dealing with a constant (and not an identifier/reserved word). Something is still needed to specify the number base: the x is an arbitrary choice.
Long story: In the 60's, the prevalent programming number systems were decimal and octal — mainframes had 12, 24 or 36 bits per byte, which is nicely divisible by 3 = log2(8).
The BCPL language used the syntax 8 1234 for octal numbers. When Ken Thompson created B from BCPL, he used the 0 prefix instead. This is great because
an integer constant now always consists of a single token,
the parser can still tell right away it's got a constant,
the parser can immediately tell the base (0 is the same in both bases),
it's mathematically sane (00005 == 05), and
no precious special characters are needed (as in #123).
When C was created from B, the need for hexadecimal numbers arose (the PDP-11 had 16-bit words) and all of the points above were still valid. Since octals were still needed for other machines, 0x was arbitrarily chosen (00 was probably ruled out as awkward).
C# is a descendant of C, so it inherits the syntax.

Note: I don't know the correct answer, but the below is just my personal speculation!
As has been mentioned a 0 before a number means it's octal:
04524 // octal, leading 0
Imagine needing to come up with a system to denote hexadecimal numbers, and note we're working in a C style environment. How about ending with h like assembly? Unfortunately you can't - it would allow you to make tokens which are valid identifiers (eg. you could name a variable the same thing) which would make for some nasty ambiguities.
8000h // hex
FF00h // oops - valid identifier! Hex or a variable or type named FF00h?
You can't lead with a character for the same reason:
xFF00 // also valid identifier
Using a hash was probably thrown out because it conflicts with the preprocessor:
#define ...
#FF00 // invalid preprocessor token?
In the end, for whatever reason, they decided to put an x after a leading 0 to denote hexadecimal. It is unambiguous since it still starts with a number character so can't be a valid identifier, and is probably based off the octal convention of a leading 0.
0xFF00 // definitely not an identifier!

It's a prefix to indicate the number is in hexadecimal rather than in some other base. The programming language uses it to tell compiler.
Example:
0x6400 translates to 6*16^3 + 4*16^2 + 0*16^1 +0*16^0 = 25600.
When compiler reads 0x6400, It understands the number is hexadecimal with the help of 0x term. Usually we can understand by (6400)16 or (6400)8 or whatever ..
For binary it would be:
0b00000001
Good day!

The preceding 0 is used to indicate a number in base 2, 8, or 16.
In my opinion, 0x was chosen to indicate hex because 'x' sounds like hex.
Just my opinion, but I think it makes sense.
Good Day!

I don't know the historical reasons behind 0x as a prefix to denote hexadecimal numbers - as it certainly could have taken many forms. This particular prefix style is from the early days of computer science.
As we are used to decimal numbers there is usually no need to indicate the base/radix. However, for programming purposes we often need to distinguish the bases from binary (base-2), octal (base-8), decimal (base-10) and hexadecimal (base-16) - as the most commonly used number bases.
At this point in time it is a convention used to denote the base of a number. I've written the number 29 in all of the above bases with their prefixes:
0b11101: Binary
0o35: Octal, denoted by an o
0d29: Decimal, this is unusual because we assume numbers without a prefix are decimal
0x1D: Hexadecimal
Basically, an alphabet we most commonly associate with a base (e.g. b for binary) is combined with 0 to easily distinguish a number's base.
This is especially helpful because smaller numbers can confusingly appear the same in all the bases: 0b1, 0o1, 0d1, 0x1.
If you were using a rich text editor though, you could alternatively use subscript to denote bases: 12, 18, 110, 116

Related

Is subtracting a char by '0' to convert to int bad practice?

I'm expecting a single digit integer input, and have error handling in place already if this is not the case. Are the any potential unforeseen consequences by simply subtracting the input character by '0' to "convert" it into an integer?
I'm not looking for opinions on readability or what's more commonly used (although they wouldn't hurt as an extension to the answer), but simply whether or not it's a reliable form of conversion. If I ask the user to input an integer between 0 and 9, is there any scenario in which there can be input that input = input-'0' should handle, but doesn't?
This is safe and guaranteed by the C language. In the current version, C11, the relevant text is 5.2.1 Character sets, ¶3:
In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.
As for whether it's "bad practice", that's a matter of opinion, but I would say no. It's both idiomatic (commonly used and understood by C programmers) and lacks any alternative that's not confusing and inefficient. For example nobody reading C would want to see this written as a switch statement with 10 cases or by setting up a dummy one-character string to pass to atoi.
The order of characters are encoding/system-dependent, so one must not rely on a particular order in general. For the sequence of digits 0..9 in any system, however, it is guaranteed that it starts with 0 and continues to 9 without any intermediate characters. So input = input - '0' is perfect as long as you guarantee that input contains a digit (e.g. by using isdigit).

Why C99 has such an odd restriction for universal character names?

6.4.3 Universal character names
A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (#), or 0060 (`), nor one in the range D800 through DFFF inclusive.
Besides the fact that it is no longer "universal" with restrictions like this, I can't think of good reasons for such a restriction. Anyone knows the backstory?
D800 through DFFF inclusive are not valid code points; they are high and low surrogates, which can only be found in pairs in UTF-16 encoding in order to represent code points outside of the base plane.
The other restriction avoids having a universal character name collide with a character which could be represented in the C character set, for the benefit of compilers which don't bother resolving universal character names into their unicode equivalents. So the compiler is under no obligation to recognize a + written as \u002B or to know that a and \u0061 represent the same name. ($, # and ` are not valid in a C program outside of comments and character strings, so they do not require any special attention from the lexer.)
The range of code points less than A0 also includes control characters and whitespace. (C does not consider \u00A0 to be whitespace.)

What does the datatype specification '9(7)V9T' mean?

In some functional specs I'm reading they are talking about a numeric format with a 9(7)V9T presentation.
-How do I interprete this kind of format notations?
-How is this type physically stored in a flatfile (e.g. numeric?, signs? separators?)
Thank you for your wise answers!
A COBOL PICTURE string, such as 9(7)V9T specifies the general characteristics and editing requirements of an elementary
data item. A 9 represents a decimal digit, the (7) is a repetition factor for the preceding character. In this case
a 9. The V is an implied decimal point. This is all standard COBOL. So far we have an 8 digit decimal number with
an implied decimal point between the 7th and 8th digits.
The T is a bit of a curve ball. I have never
actually come across it before. However,
I Goolged up this reference.
It states that a T in a PICTURE string "... indicates that a display numeric field should only insert the sign into the upper
half of the last byte if the value is negative". Unfortunately, the author doesn't provide a reference so I can't
give you the source of this convention.
A COBOL picture of PIC S9(7)V9 USAGE DISPLAY on an IBM platform conforms to the 9(7)V9T description you have. This
data item
takes 8 bytes to represent. Each of the 8 digits are represented in the low 4 bits of each byte with the sign
recorded in the upper 4 bits of the low order byte. This just happens to be the way IBM choose to implement zoned-decimal.
Using a 9(7)V9T representation makes the representation explicit.
An alternative to the other answers is that the T is a character to be displayed or printed after the numeric value to represent a specific state, similar to use of CR for credit value or a trailing '-' to indicate a negative value.

Value of C define changes unexpectedly

I have a lot of #define's in my code. Now a weird problem has crept up.
I have this:
#define _ImmSign 010100
(I'm trying to simulate a binary number)
Obviously, I expect the number to become 10100. But when I use the number it has changed into 4160.
What is happening here? And how do I stop it?
ADDITIONAL
Okay, so this is due to the language interpreting this as an octal. Is there some smart way however to force the language to interpret the numbers as integers? If a leading 0 defines octal, and 0x defines hexadecimal now that I think of it...
Integer literals starting with a 0 are interpreted as octal, not decimal, in the same way that integer literals starting with 0x are interpreted as hexadecimal.
Remove the leading zero and you should be good to go.
Note also that identifiers beginning with an underscore followed by a capital letter or another underscore are reserved for the implementation, so you shouldn't define them in your code.
Prefixing an integer with 0 makes it an octal number instead of decimal, and 010100 in octal is 4160 in decimal.
There is no binary number syntax in C, at least without some compiler extension. What you see is 010100 interpreted as an octal (base 8) number: it is done when a numeric literal begins with 0.
010100 is treated as octal by C because of the leading 0. Octal 10100 is 4160.
Check this out it has some macros for using binary numbers in C
http://www.velocityreviews.com/forums/t318127-using-binary-numbers-in-c.html
There is another thread that has this also
Can I use a binary literal in C or C++?
If you are willing to write non-portable code and use gcc, you can use the binary constants extension:
#define _ImmSign 0b010100
Octal :-)
You may find these macros helpful to represent binary numbers with decimal or octal numbers in the form of 1's and 0's. They do handle leading zeros, but unfortunately you have to pick the correct macro name depending on whether you have a leading zero or not. Not perfect, but hopefully helpful.

how is 65 translated to 'A' character?

In ASCII, i wonder how is 65 translated to 'A' character?
As far as my knowledge goes, 65 can be represented in binary but 'A' is not. So how could this conversion happen?
Everything in a computer is binary. So a string in C is a sequence of binary values. Obviously that is not much use to humans, so various standards developed, where people decided what numerical values would represent certain letters. In ASCII the value 65 represents the letter A. So the value stored is 65, but everyone knows (because they have read the ASCII spec) that value corresponds to the letter A.
For example, if I am writing the code to display text on the screen, and I receive the value 65, I know to set certain pixels and delete other pixels, so that pixels are arranged like:
#
# #
#####
# #
# #
At no point does my code "really know" that is an "A". It just knows that 65 is displayed as that pattern. Because, as you say, you cannot store letters directly, only binary numbers.
It is just a 'definition'. ASCII defines the relationships between integer values and characters. For implementation, there is a table (you can't see it) that does this translation.
EDIT:
Computers just 0/1. A stream of characters is just a bunch of 0/1 streams: 0110010101... There is a contract between human and computer: 8 bits are represented as a character (okay, there are Unicode, UTF-8 and etc). And, 'A' is 65 and so on.
In C/C++ and any other languages, strings are just handled like integer arrays. Only when you need to display strings, that numbers are 'translated' into character. This translation is done by either hardware or software:
If you write a function that draws character, you're responsible to draw 'A' when the input is 65.
In the past, say that we're in DOS, the computer draws 'A' on the number 65. That relationship is usually stored in the memory. (At that time where no graphics, only text, this table can be tweaked to extend characters. I remember Norton DOS utilities such as NDD/NCD changed this table to draw some special characters that were not in the regular ASCII code.)
You may see this sort of contract or definition in everywhere. For example, assembly code. Your program will be eventually translated into machine code: that is also just a bunch of 0 and 1. But, it is extremely hard to understand when only 0 and 1 are shown. So, there is a rule: say 101010 means "add", 1100 means "mov". That's why we can program like "add eax, 1", and it'll be ultimately decoded into 0/1s.
'A' IS 65. It's just that your display device knows that it should display the value 65 as an A when it renders that value as a character.
The ASCII table is just an agreed upon map of values and characters.
When the computer is instructed to write a character represented by a number to the screen it just finds the numbers corresponding image. The image doesn't make any sense to the computer, it could be an image that looks like an 'A' or a snowman to the user.
So how could this conversion happen?
This conversion is merely called character encoding. The computer only understands bytes and humans (on average =) ) only understands characters. The computer has roughly said a mapping of all bytes and all characters which belongs to those bytes so that it can present the data in a human friendly manner. It's all software based (thus not hardware based). The operating system is usually the one who takes care about this.
ASCII is one of the oldest character encodings. Nowadays we should be all on UTF-8 to avoid Mojibake.
Everything in a computer is stored as a number. It's how software interprets those numbers that's important.
ASCII is a standard that maps the number 65 to the letter 'A'. They could have chosen 66 or 14 to represent 'A', but they didn't. It's almost arbitrary.
So if you have the number 65 sitting in computer memory somewhere, a piece of code that treats that piece of memory as ASCII will map the 65 to 'A'. Another piece of code that treats that memory as an entirely different format may translate it to something else entirely.
The code for converting the ASCII value entered to corresponding character is
int a;
printf("enter the ASCII value : ");
scanf("%d",&a);
printf("%d is the ASCII of %c",a,a);
Its based on a lookup table invented back in the 60's.

Resources