Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I'm struggling to understand how truncation works when converting from unsigned to Two's Complement. Can someone please explain? (my textook uses the example of truncating a 4 bit value to a 3 bit value, and says that -1 becomes -1, but -5 becomes 3).
-1 represented on four binary bits is:
1 1 1 1
(-1 is always represented as all bits 1 in 2's complement).
In your textbook “truncating” is simply used to mean(*) “cutting off the highest-order bit(s)”:
1 1 1
The result still has all its bits sets so it still represents -1 — This time, the 3-bit 2's complement version of -1.
-5 is represented in 2's complement on 4 bits as:
1 0 1 1
Chopping off the highest-order bit:
0 1 1
We are left with the 3-bit representation of 3. The reason we could not get -5 any more is that -5's magnitude is too large to fit in a 3-bit format.
Numbers with smaller magnitude, that can be represented with 3 bits, are unchanged when the higher-order bits are chopped off. This is the case for numbers from -4 to 3.
(*) Note that usually “truncating” means keeping the most significant bits and removing the least significant ones, especially in the context of floating-point where the bits with weight less than one are erased when converting to integer by “truncation”. The choice of words in the OP's book is very doubtful, unless the book is not in English and words do not map exactly to English when translated.
Related
As far as I know, if I want to represent -1 in binary form, then:
I'll first look for the binary representation of 1 which is 0001.
Then I'll find one's complement (invert all 0's and 1's) to get 1110.
Then add 1 to least significant bit and get 1111. which is my answer.
However I have a doubt that if I represent 1 (in step number 1) as 001 (I believe we can do this), then one's complement would be 110 and adding 1 would yield me 111 which is different from what I obtained previously.
How do you explain this difference?
The twos-complement representation of -1 is "all 1s".
However many places are in your number representation set all of them to 1 and that is the two's complement of 1.
For more details look up "sign extending".
I'm reading Modern C (version Feb 13, 2018.) and on page 42 it says
It says that the bit with index 4 is the least significant bit. Isn't the bit with index 0 should be the least significant bit? (Same question about MSB.)
Which is right? What's the correct terminology?
Their definition of "most significant bit" and "least significant bit" is misleading:
8 bit Binary number : 1 1 1 1 0 0 0 0
Bit number 7 6 5 4 3 2 1 0
| | |
| | least significant bit
| |
| |
| least significant bit that is 1
|
most significant bit that is 1 and also just most significant bit
The book's definition does not align with common/typical/mainstream/correct usage. See Wikipedia, for instance:
In computing, the least significant bit (LSB) is the bit position in a binary integer giving the units value, that is, determining whether the number is even or odd.
The book, on the other hand, seems to consider only bits that are 1, so that in an 8-bit byte representing the number 16, which we can write:
00010000
the bit that is 1 has index 4 (it's b4 in the book's notation), and then it claims that that particular number's LSB is four.
The proper definition just uses LSB to denote that bit whose value is 1, i.e. the "units", and with that the LSB is the rightmost bit. This latter definition is more useful, and I really think the book is wrong.
They're using an unusual definition of LSB and MSB, which only refers to the bits that are set to 1. So in the case of 240, the first 1 bit is b4, not b0, because b0 through b3 are all 0.
I'm not sure why the book considers this definition of LSB/MSB to be useful. It's not generally interesting for integers, although it does come into play in floating point. Floating point numbers are scaled so integers above 1 have the low-order zero bits shifted away, and the exponent is incremented to make up for this (conversely, fractions have their high-order bits shifted away, and the exponent is decremented).
Given the following C code:
int x = atoi(argv[1]);
int y = (x & -x);
if (x==y)
printf("Wow... they are the same!\n");
What values of x will result in "Wow... they are the same!" getting printed? Why?
So. It generally depends, but I can assume, that your architecture represents numbers with sign in U2 format (everything is false if it's not in U2 format). Let's have an example.
We take 3, which representation will be like:
0011
and -3. which will be:
~ 0011
+ 1
-------
1101
and we make and
1101
& 0011
------
0001
so:
1101 != 0001
that's what is happening underhood. You have to find numbers that fit to this pattern. I do not know what kind of numbers fit it upfront. But basing on this you can predict this.
The question is asking about the binary & operator, and 2's compliment arithmetic.
I would look to how numbers are represented in 2's compliment, and what the binary & symbol does.
Assuming a 2's compliment representation for negative numbers, the only values for which this is true are positive numbers of the form 2^n where n >= 0, and 0.
When you take the 2's compliment of a number, you flip all bits and then add one. So the least significant bit will always match. The next bit won't match unless the prior carried over, and the same for the next bit.
An int is typically 32 bits, however I'll use 5 bits in the following examples for simplicity.
For example, 5 is 00101. Flipping all bits gives us 11010, then adding 1 gives us 11011. Then 00101 & 11011 = 00001. The only bit that matches a set bit is the last one, so 5 doesn't work.
Next we'll try 12, which is 01100. Flipping the bits gives us 10011, then adding 1 gives us 10100. Then 01100 & 10100 = 00100. Because of the carry-over the third bit is set, however the second bit is not, so 12 doesn't work either.
So the most significant bit which is set won't match unless all lower bits carry over when 1 is added. This is true only for numbers with one bit set, i.e. powers of 2.
If we now try 8, which is 01000, flipping the bits gives us 10111 and adding 1 gives us 11000. And 01000 & 11000 = 01000. In this case, the second bit is set, which is the only bit set in the original number. So the condition holds.
Negative numbers cannot satisfy this condition because positive numbers have the most significant bit set to 0, while negative numbers have the most significant bit set to 1. So a bitwise AND of a number and its negative will always have the most significant bit set to 0, meaning this number cannot be negative.
0 is a special case since it is its own negative. 0 & 0 = 0, so it also satisfies this condition.
Another special case is the smallest number you can represent. In the case of a 5-bit number this is -16, which is represented by 10000. Flipping all the bits gives you 01111 and adding 1 gives you 10000, which is the same number. On the surface it seems this number also satisfies the condition, however this is an overflow condition and implementations may not handle this case correctly. See this link for more details.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
First of all i woild like to point out that i am not native speaker and i really need some terms used more commonly.
And the second thing i would like to mention is that i am not a math genious. I am really trying to understand everything about programming.. but ieee-754 makes me think that it'll never happan.. its full of mathematical terms i don't understand..
What is precision? What is it used for? What is mantissa and what is mantissa used for? How to determine the range of float/double by their size? What is ± symbol (Plus-minus) used for? (i believe its positive/negative choice but what does that have to do with everything?),
Isn't there any brief and clean explanation you guys could provide me with?
I spent 600 years of trying to understand wikipedia. I failed tremendously.
What is precision?
It refers to how closely a binary floating point representation can represent a real value. Real values have infinite precision and infinite range. Digital values have finite range and precision. In practice a single-precision IEEE-754 can represent real values of a precision of 6 significant figures (decimal), while double-precision is good for 15 significant figures.
The practical effect of this for example is that a single precision value: 123456000.00 cannot be distinguished from say 123456001.00, but equally a value 0.00123456 can be represented.
What is it used for?
Precision is not used for anything other than to define a characteristic of a particular floating point representation.
What is mantissa and what is mantissa used for?
The term is not mentioned in the English language Wikipedia article, and is imprecise - in mathematics in general it has a different meaning that that used here.
The correct term is significand. For a decimal value 0.00123456 for example the significand is is 123456. 123456000.00 has exactly the same significand. Each of these values has the same significand but a different exponent. The exponent is a scaling factor which determines where the decimal point is (hence floating point).
Of course IEEE754 is a binary floating point representation not decimal, but for the same of explanation of the terms it is perhaps easier to use decimal.
How to determine the range of float/double by their size?
By the size alone you cannot; you need to know how many bits are assigned to the significand and how many bits are assigned to the exponent. In C however the range is defined by the macros FLT_MIN, FLT_MAX, DBL_MIN and DBL_MAX in the float.h header. Other characteristics of the implementations floating point representation are described there also.
Note that a specific compiler may not in fact use IEEE754, however that is the format used by most hardware FPU implementations, and the compiler will naturally follow that. For targets with no FPU (small embedded processors typically), other formats may be used.
What is ± symbol (Plus-minus) used for?
It simply means that the value given may be both positive or negative. It may refer to a specific value, or it may indicate a range. So ±n may refer to two discrete values -n or +n, or it may mean a range -n to +n. Context is everything! In this article it refers to discrete values +0, -0, +∞ and -∞.
There are 3 different components: sign, exponent, mantissa
Assuming that the exponent has only 2 Bits, 4 combinations are possible:
binary decimal
00 0
01 1
10 2
11 3
The represented floating-point value is 2exponent:
binary exponent-value
00 2^0 = 1
01 2^1 = 2
10 2^2 = 4
11 2^3 = 8
The range of the floating point value, results from the exponent. 2 bits => maximum value = 8.
The mantissa divide the range from a given exponent to the next higher exponent.
For example the exponent is 2 and the mantissa has one bit, then there are two values possible:
exponent-value mantissa-binary represented floating-point value
2 0 2
2 1 3
The represented floating-point value is 2exponent × (1 + m1×2-1 + m2×2-2 + m3×2-3 + …).
Here an example with a 3 bit mantissa:
exponent-value mantissa-binary represented floating-point value
2 000 2 * (1 ) = 2
2 001 2 * (1 + 2^-3) = 2,25
2 010 2 * (1 + 2^-2 ) = 2,5
2 011 2 * (1 + 2^-2 + 2^-3) = 2,75
2 100 2 * (1 + 2^-1 ) = 3
and so on…
The sign has only just one Bit:
0 -> positive value
1 -> negative value
In IEEE-754 a 32 bit floating-point data type has an 8 bit exponent (with a range from 2-127 to 2128) and a 23 bit mantissa.
1 10000010 01101000000000000000000
- 130 1,40625
The represented floating-point value for this is:
-1 × 2(130 – 127) × (1 + 2-2 + 2-3 + 2-5) = -11,25
Try it: http://www.h-schmidt.net/FloatConverter/IEEE754.html
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Why do we always divide our RGB values by 255? I know that the range is from [0-1]. But why dive only by 255? Can anyone please explain me the concepts of RGB values?
RGB (Red, Green, Blue) are 8 bit each.
The range for each individual colour is 0-255 (as 2^8 = 256 possibilities).
The combination range is 256*256*256.
By dividing by 255, the 0-255 range can be described with a 0.0-1.0 range where 0.0 means 0 (0x00) and 1.0 means 255 (0xFF).
This is a bit of a generic question since it can be specific to the platform and even to the method. It really comes down to math and getting a value between 0-1. Since 255 is the maximum value, dividing by 255 expresses a 0-1 representation.
Each channel (Red, Green, and Blue are each channels) is 8 bits, so they are each limited to 256, in this case 255 since 0 is included. As the reference shows, systems typically use values between 0-1 when using floating point values.
http://en.wikipedia.org/wiki/RGB_color_model
See Numeric Representations.
These ranges may be quantified in several different ways: From 0 to 1,
with any fractional value in between. This representation is used in
theoretical analyses, and in systems that use floating point
representations. Each color component value can also be written as a
percentage, from 0% to 100%. In computers, the component values are
often stored as integer numbers in the range 0 to 255, the range that
a single 8-bit byte can offer. These are often represented as either
decimal or hexadecimal numbers. High-end digital image equipment are
often able to deal with larger integer ranges for each primary color,
such as 0..1023 (10 bits), 0..65535 (16 bits) or even larger, by
extending the 24-bits (three 8-bit values) to 32-bit, 48-bit, or
64-bit units (more or less independent from the particular computer's
word size).
RGB values are usually stored as integers to save memory. But doing math on colors is usually done in float because it's easier, more powerful, and more precise. The act of converting floats to integers is called "Quantization", and it throws away precision.
Typically, RGB values are encoded as 8-bit integers, which range from 0 to 255. It's an industry standard to think of 0.0f as black and 1.0f as white (max brightness). To convert [0, 255] to [0.0f, 1.0f] all you have to do is divide by 255.0f.
If you care, this is the formula to convert back to integer: (int)floor(x * 255.0f + 0.5f). But first clamp x to [0.0f, 1.0f] if necessary.
The RGB value goes up from 0 to 255 because it takes up exactly one byte of data. One byte is equal to 8 bits, and each bit represents either a 0 or a 1. This makes 0 in 8 bit binary: 00000000 and 255 11111111. The last bit says if there is a 1 in the value. The second last says if there is a 2 in the value. The third last says if there is a 4 in the value, and so on doubling every time. If you add up all of the small values that are present, you get the total value. For example,
=10110101
=1*128 + 0*64 + 1*32 + 1*16 + 0*8 + 1*4 + 0*2 + 1*1
=128 + 32 + 16 + 4 + 1
=181
This means that 10110101 in binary equals 181 in decimal form.
Given that each Ocket is nowadays made of 8 bits ( binary digit)
Suppose we have an Ocket filled like this :
1 0 1 0 0 1 0 1
for each bit you get 2 possibilities : 0 or 1
2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 = 2^8 = 256
total : 256
And for hexadecimal colors :
given that you have 3 couples of characters, dash excluded => ex: #00ff00
0, 1, 2, 3, 4 , 5, 6, 7, 8, 9, a, b, c, d, e, f = 16 possibilities
16 x 16 = 256
R V B = color
256 x 256 x 256 = 16 777 216 colors)
It makes the vector operations simpler... Imagine you have image and you want to change its color to red. With vectors you can just take every pixel and multiply it by (1.0, 0.0, 0.0)
P * (1.0, 0.0, 0.0)
Otherwise it just adds unnecessary steps (in this case dividing it by 255)
P * (255, 0, 0) / 255
And imagine using more complex filters, the unnecessary steps would stack up...