I am writing a program where I need to delete duplicate points stored in a matrix. The problem is that when it comes to check whether those points are in the matrix, MATLAB can't recognize them in the matrix although they exist.
In the following code, intersections function gets the intersection points:
[points(:,1), points(:,2)] = intersections(...
obj.modifiedVGVertices(1,:), obj.modifiedVGVertices(2,:), ...
[vertex1(1) vertex2(1)], [vertex1(2) vertex2(2)]);
The result:
>> points
points =
12.0000 15.0000
33.0000 24.0000
33.0000 24.0000
>> vertex1
vertex1 =
12
15
>> vertex2
vertex2 =
33
24
Two points (vertex1 and vertex2) should be eliminated from the result. It should be done by the below commands:
points = points((points(:,1) ~= vertex1(1)) | (points(:,2) ~= vertex1(2)), :);
points = points((points(:,1) ~= vertex2(1)) | (points(:,2) ~= vertex2(2)), :);
After doing that, we have this unexpected outcome:
>> points
points =
33.0000 24.0000
The outcome should be an empty matrix. As you can see, the first (or second?) pair of [33.0000 24.0000] has been eliminated, but not the second one.
Then I checked these two expressions:
>> points(1) ~= vertex2(1)
ans =
0
>> points(2) ~= vertex2(2)
ans =
1 % <-- It means 24.0000 is not equal to 24.0000?
What is the problem?
More surprisingly, I made a new script that has only these commands:
points = [12.0000 15.0000
33.0000 24.0000
33.0000 24.0000];
vertex1 = [12 ; 15];
vertex2 = [33 ; 24];
points = points((points(:,1) ~= vertex1(1)) | (points(:,2) ~= vertex1(2)), :);
points = points((points(:,1) ~= vertex2(1)) | (points(:,2) ~= vertex2(2)), :);
The result as expected:
>> points
points =
Empty matrix: 0-by-2
The problem you're having relates to how floating-point numbers are represented on a computer. A more detailed discussion of floating-point representations appears towards the end of my answer (The "Floating-point representation" section). The TL;DR version: because computers have finite amounts of memory, numbers can only be represented with finite precision. Thus, the accuracy of floating-point numbers is limited to a certain number of decimal places (about 16 significant digits for double-precision values, the default used in MATLAB).
Actual vs. displayed precision
Now to address the specific example in the question... while 24.0000 and 24.0000 are displayed in the same manner, it turns out that they actually differ by very small decimal amounts in this case. You don't see it because MATLAB only displays 4 significant digits by default, keeping the overall display neat and tidy. If you want to see the full precision, you should either issue the format long command or view a hexadecimal representation of the number:
>> pi
ans =
3.1416
>> format long
>> pi
ans =
3.141592653589793
>> num2hex(pi)
ans =
400921fb54442d18
Initialized values vs. computed values
Since there are only a finite number of values that can be represented for a floating-point number, it's possible for a computation to result in a value that falls between two of these representations. In such a case, the result has to be rounded off to one of them. This introduces a small machine-precision error. This also means that initializing a value directly or by some computation can give slightly different results. For example, the value 0.1 doesn't have an exact floating-point representation (i.e. it gets slightly rounded off), and so you end up with counter-intuitive results like this due to the way round-off errors accumulate:
>> a=sum([0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]); % Sum 10 0.1s
>> b=1; % Initialize to 1
>> a == b
ans =
logical
0 % They are unequal!
>> num2hex(a) % Let's check their hex representation to confirm
ans =
3fefffffffffffff
>> num2hex(b)
ans =
3ff0000000000000
How to correctly handle floating-point comparisons
Since floating-point values can differ by very small amounts, any comparisons should be done by checking that the values are within some range (i.e. tolerance) of one another, as opposed to exactly equal to each other. For example:
a = 24;
b = 24.000001;
tolerance = 0.001;
if abs(a-b) < tolerance, disp('Equal!'); end
will display "Equal!".
You could then change your code to something like:
points = points((abs(points(:,1)-vertex1(1)) > tolerance) | ...
(abs(points(:,2)-vertex1(2)) > tolerance),:)
Floating-point representation
A good overview of floating-point numbers (and specifically the IEEE 754 standard for floating-point arithmetic) is What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg.
A binary floating-point number is actually represented by three integers: a sign bit s, a significand (or coefficient/fraction) b, and an exponent e. For double-precision floating-point format, each number is represented by 64 bits laid out in memory as follows:
The real value can then be found with the following formula:
This format allows for number representations in the range 10^-308 to 10^308. For MATLAB you can get these limits from realmin and realmax:
>> realmin
ans =
2.225073858507201e-308
>> realmax
ans =
1.797693134862316e+308
Since there are a finite number of bits used to represent a floating-point number, there are only so many finite numbers that can be represented within the above given range. Computations will often result in a value that doesn't exactly match one of these finite representations, so the values must be rounded off. These machine-precision errors make themselves evident in different ways, as discussed in the above examples.
In order to better understand these round-off errors it's useful to look at the relative floating-point accuracy provided by the function eps, which quantifies the distance from a given number to the next largest floating-point representation:
>> eps(1)
ans =
2.220446049250313e-16
>> eps(1000)
ans =
1.136868377216160e-13
Notice that the precision is relative to the size of a given number being represented; larger numbers will have larger distances between floating-point representations, and will thus have fewer digits of precision following the decimal point. This can be an important consideration with some calculations. Consider the following example:
>> format long % Display full precision
>> x = rand(1, 10); % Get 10 random values between 0 and 1
>> a = mean(x) % Take the mean
a =
0.587307428244141
>> b = mean(x+10000)-10000 % Take the mean at a different scale, then shift back
b =
0.587307428244458
Note that when we shift the values of x from the range [0 1] to the range [10000 10001], compute a mean, then subtract the mean offset for comparison, we get a value that differs for the last 3 significant digits. This illustrates how an offset or scaling of data can change the accuracy of calculations performed on it, which is something that has to be accounted for with certain problems.
Look at this article: The Perils of Floating Point. Though its examples are in FORTRAN it has sense for virtually any modern programming language, including MATLAB. Your problem (and solution for it) is described in "Safe Comparisons" section.
type
format long g
This command will show the FULL value of the number. It's likely to be something like 24.00000021321 != 24.00000123124
Try writing
0.1 + 0.1 + 0.1 == 0.3.
Warning: You might be surprised about the result!
Maybe the two numbers are really 24.0 and 24.000000001 but you're not seeing all the decimal places.
Check out the Matlab EPS function.
Matlab uses floating point math up to 16 digits of precision (only 5 are displayed).
Related
I am writing a program where I need to delete duplicate points stored in a matrix. The problem is that when it comes to check whether those points are in the matrix, MATLAB can't recognize them in the matrix although they exist.
In the following code, intersections function gets the intersection points:
[points(:,1), points(:,2)] = intersections(...
obj.modifiedVGVertices(1,:), obj.modifiedVGVertices(2,:), ...
[vertex1(1) vertex2(1)], [vertex1(2) vertex2(2)]);
The result:
>> points
points =
12.0000 15.0000
33.0000 24.0000
33.0000 24.0000
>> vertex1
vertex1 =
12
15
>> vertex2
vertex2 =
33
24
Two points (vertex1 and vertex2) should be eliminated from the result. It should be done by the below commands:
points = points((points(:,1) ~= vertex1(1)) | (points(:,2) ~= vertex1(2)), :);
points = points((points(:,1) ~= vertex2(1)) | (points(:,2) ~= vertex2(2)), :);
After doing that, we have this unexpected outcome:
>> points
points =
33.0000 24.0000
The outcome should be an empty matrix. As you can see, the first (or second?) pair of [33.0000 24.0000] has been eliminated, but not the second one.
Then I checked these two expressions:
>> points(1) ~= vertex2(1)
ans =
0
>> points(2) ~= vertex2(2)
ans =
1 % <-- It means 24.0000 is not equal to 24.0000?
What is the problem?
More surprisingly, I made a new script that has only these commands:
points = [12.0000 15.0000
33.0000 24.0000
33.0000 24.0000];
vertex1 = [12 ; 15];
vertex2 = [33 ; 24];
points = points((points(:,1) ~= vertex1(1)) | (points(:,2) ~= vertex1(2)), :);
points = points((points(:,1) ~= vertex2(1)) | (points(:,2) ~= vertex2(2)), :);
The result as expected:
>> points
points =
Empty matrix: 0-by-2
The problem you're having relates to how floating-point numbers are represented on a computer. A more detailed discussion of floating-point representations appears towards the end of my answer (The "Floating-point representation" section). The TL;DR version: because computers have finite amounts of memory, numbers can only be represented with finite precision. Thus, the accuracy of floating-point numbers is limited to a certain number of decimal places (about 16 significant digits for double-precision values, the default used in MATLAB).
Actual vs. displayed precision
Now to address the specific example in the question... while 24.0000 and 24.0000 are displayed in the same manner, it turns out that they actually differ by very small decimal amounts in this case. You don't see it because MATLAB only displays 4 significant digits by default, keeping the overall display neat and tidy. If you want to see the full precision, you should either issue the format long command or view a hexadecimal representation of the number:
>> pi
ans =
3.1416
>> format long
>> pi
ans =
3.141592653589793
>> num2hex(pi)
ans =
400921fb54442d18
Initialized values vs. computed values
Since there are only a finite number of values that can be represented for a floating-point number, it's possible for a computation to result in a value that falls between two of these representations. In such a case, the result has to be rounded off to one of them. This introduces a small machine-precision error. This also means that initializing a value directly or by some computation can give slightly different results. For example, the value 0.1 doesn't have an exact floating-point representation (i.e. it gets slightly rounded off), and so you end up with counter-intuitive results like this due to the way round-off errors accumulate:
>> a=sum([0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]); % Sum 10 0.1s
>> b=1; % Initialize to 1
>> a == b
ans =
logical
0 % They are unequal!
>> num2hex(a) % Let's check their hex representation to confirm
ans =
3fefffffffffffff
>> num2hex(b)
ans =
3ff0000000000000
How to correctly handle floating-point comparisons
Since floating-point values can differ by very small amounts, any comparisons should be done by checking that the values are within some range (i.e. tolerance) of one another, as opposed to exactly equal to each other. For example:
a = 24;
b = 24.000001;
tolerance = 0.001;
if abs(a-b) < tolerance, disp('Equal!'); end
will display "Equal!".
You could then change your code to something like:
points = points((abs(points(:,1)-vertex1(1)) > tolerance) | ...
(abs(points(:,2)-vertex1(2)) > tolerance),:)
Floating-point representation
A good overview of floating-point numbers (and specifically the IEEE 754 standard for floating-point arithmetic) is What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg.
A binary floating-point number is actually represented by three integers: a sign bit s, a significand (or coefficient/fraction) b, and an exponent e. For double-precision floating-point format, each number is represented by 64 bits laid out in memory as follows:
The real value can then be found with the following formula:
This format allows for number representations in the range 10^-308 to 10^308. For MATLAB you can get these limits from realmin and realmax:
>> realmin
ans =
2.225073858507201e-308
>> realmax
ans =
1.797693134862316e+308
Since there are a finite number of bits used to represent a floating-point number, there are only so many finite numbers that can be represented within the above given range. Computations will often result in a value that doesn't exactly match one of these finite representations, so the values must be rounded off. These machine-precision errors make themselves evident in different ways, as discussed in the above examples.
In order to better understand these round-off errors it's useful to look at the relative floating-point accuracy provided by the function eps, which quantifies the distance from a given number to the next largest floating-point representation:
>> eps(1)
ans =
2.220446049250313e-16
>> eps(1000)
ans =
1.136868377216160e-13
Notice that the precision is relative to the size of a given number being represented; larger numbers will have larger distances between floating-point representations, and will thus have fewer digits of precision following the decimal point. This can be an important consideration with some calculations. Consider the following example:
>> format long % Display full precision
>> x = rand(1, 10); % Get 10 random values between 0 and 1
>> a = mean(x) % Take the mean
a =
0.587307428244141
>> b = mean(x+10000)-10000 % Take the mean at a different scale, then shift back
b =
0.587307428244458
Note that when we shift the values of x from the range [0 1] to the range [10000 10001], compute a mean, then subtract the mean offset for comparison, we get a value that differs for the last 3 significant digits. This illustrates how an offset or scaling of data can change the accuracy of calculations performed on it, which is something that has to be accounted for with certain problems.
Look at this article: The Perils of Floating Point. Though its examples are in FORTRAN it has sense for virtually any modern programming language, including MATLAB. Your problem (and solution for it) is described in "Safe Comparisons" section.
type
format long g
This command will show the FULL value of the number. It's likely to be something like 24.00000021321 != 24.00000123124
Try writing
0.1 + 0.1 + 0.1 == 0.3.
Warning: You might be surprised about the result!
Maybe the two numbers are really 24.0 and 24.000000001 but you're not seeing all the decimal places.
Check out the Matlab EPS function.
Matlab uses floating point math up to 16 digits of precision (only 5 are displayed).
Problem 1: I have the decimal representation of a rational. This is the code for generating binary number.
x(1) = rand();
[num, den] = rat(x);
q = 2^32;
x1 = num / den * q;
b = dec2bin(x1, bits);
s = str2num(b')';
UPDATE: The information about Dyadic map expressed in code as
y = mod(x*2, 1)
says that if the input, x is a binary iterate s, then the output should be binary with the bits shifted to the left by one position. But, if I give the input x = 0.1101 or x = 1101 or x= 1 (bit) still the output y is not binary.
The machine understands the input as a decimal and hence returns a decimal base number. How can I use this map to model / represent binary valued random variables?
Problem 2: (SOLVED BASED ON THE ANSWER)
Secondly, I need to do another operation involving the command
(X(:,i)>=threshold)*(X(:,i)>=threshold)';
where X is a matrix of real valued numbers and the variable
threshold = 0.5
and i is the index for the element. I keep getting this error
Error using *
Both logical inputs must be scalar.
To compute elementwise TIMES, use TIMES (.*) instead.
I tried using the .* but still I keep getting this error. How do I solve these 2 problems?
It shall be helpful if a code is provided.
Problem 1: I have the decimal representation of a rational.
Great. So far so good...
This is the code for generating binary number.
No, this is the code for generating the binary representation of a number. It's the same number that you represented in decimal. I know you think I'm being pedantic, but as far as I can determine, this is the source of your confusion. A number is a number regardless of the representation. Five sheep is five sheep whether you write it in binary, decimal, octal or using the fingers on Hammish's left hand (he's only got 4 left).
Let's change your code slightly.
bits = 32;
r = rand();
[num, den] = rat(r);
q = 2^bits;
x(1) = num / den;
The value stored in x(1) is a rational number. If we type disp(x(1)) in Matlab, it will show us the value of that number in decimal representation. We can change that representation to binary using the dec2bin command:
b(1,:) = dec2bin(round(x(1)*q), bits);
But it's still the same number. (Actually, it's not the same number because we've now limited the precision to bits bits instead of the native 53 bits Matlab generated it with. More on this later.)
But dec2bin returns the value represented in a character string rather than a number. If we want to implement your function and keep down this path of using the binary representation, we could do something like this:
b(1,:) = dec2bin(round(x(1)*q), bits);
for d = 2:bits
b(d,:) = [b(d-1,2:end) '0'];
end
Each left-shift of the binary representation multiplies the value by 2. By ignoring the bit that's now to the left of the binary point, I'm implicitly performing the mod operation. Since we have no additional significant digits to add to the least-significant bit of the value, I just add a zero.
This will work; you get the proper values and can perform whatever operations on them you want. You can represent them as binary or decimal, you can turn them back into fractions, whatever.
But you can achieve the same thing without conversion to a binary representation.
x(1) = num / den;
for d = 2:bits
x(d) = mod(x(d-1)*2, 1);
end
(Note that I left the value in x(1) as a fraction.)
This does exactly the same operation on the exact same numbers. The one difference is that I didn't reduce the precision of the number at the beginning so it uses the full double precision. Now if I want to take these values and represent them as binary, I can still do that (remember to force the value to the integer range first, though).
c = dec2bin(round(x*q), bits);
Here's the result of a test run of both versions:
b =
11110000011101110111110010010001
11100000111011101111100100100010
11000001110111011111001001000100
10000011101110111110010010001000
00000111011101111100100100010000
00001110111011111001001000100000
00011101110111110010010001000000
00111011101111100100100010000000
01110111011111001001000100000000
11101110111110010010001000000000
11011101111100100100010000000000
10111011111001001000100000000000
01110111110010010001000000000000
11101111100100100010000000000000
11011111001001000100000000000000
10111110010010001000000000000000
01111100100100010000000000000000
11111001001000100000000000000000
11110010010001000000000000000000
11100100100010000000000000000000
11001001000100000000000000000000
10010010001000000000000000000000
00100100010000000000000000000000
01001000100000000000000000000000
10010001000000000000000000000000
00100010000000000000000000000000
01000100000000000000000000000000
10001000000000000000000000000000
00010000000000000000000000000000
00100000000000000000000000000000
01000000000000000000000000000000
10000000000000000000000000000000
c =
11110000011101110111110010010001
11100000111011101111100100100001
11000001110111011111001001000010
10000011101110111110010010000101
00000111011101111100100100001001
00001110111011111001001000010011
00011101110111110010010000100101
00111011101111100100100001001010
01110111011111001001000010010100
11101110111110010010000100101000
11011101111100100100001001010001
10111011111001001000010010100001
01110111110010010000100101000011
11101111100100100001001010000101
11011111001001000010010100001010
10111110010010000100101000010101
01111100100100001001010000101010
11111001001000010010100001010100
11110010010000100101000010100111
11100100100001001010000101001111
11001001000010010100001010011101
10010010000100101000010100111010
00100100001001010000101001110100
01001000010010100001010011101000
10010000100101000010100111010000
00100001001010000101001110100000
01000010010100001010011101000000
10000100101000010100111010000000
00001001010000101001110100000000
00010010100001010011101000000000
00100101000010100111010000000000
01001010000101001110100000000000
The two are identical except for the fact that b runs out of precision after 32 bits and c has 53 bits of precision. You can confirm this by running the code above but casting x(1) to a single:
x(1) = single(num / den);
Problem 1: (UPDATED)
This reflects your updates that your goal is a Dyadic Mapping.
Think of Matlab as an environment to abstract the notion of binary numbers. It doesn't have built-in support numerical operations with binary numbers. In fact, it doesn't have a numerical representation of bits. It has strings for bits only. You can put a decimal number through a custom function to make it look binary, but to Matlab its still a float. If you put x = 0.1101 through y= mod(2*x,1) it will treat x as a floating point
Problem 2:
I'm not sure what you're trying to do here. The error is caused by trying to matrix multiply a vector of type logical. Matrix multiplication is only defined for numeric types. A temporary hack would be to add 0.0 to the vectors before multiplying thus casting the values to a double
((X(:,i)>=threshold)+0.0)*((X(:,i)>=threshold)+0.0)';
This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 8 years ago.
So I've been attempting to solve the 3SUM problem and the following is my algorithm
def findThree(seq, goal):
for x in range(len(seq)-1):
left = x+1
right = len(seq)-1
while (left < right):
tmp = seq[left] + seq[right] + seq[x]
if tmp > goal:
right -= 1
elif tmp < goal:
left += 1
else:
return [seq[left],seq[right],seq[x]]
As you can see it's a really generic algorithm that solves it in O(n2).
The problem I've been experiencing is that this algorithm doesn't seem to like working with floating point numbers.
To test that my theory is right, I gave it the following two array
FloatingArr = [89.95, 120.0, 140.0, 179.95, 199.95, 259.95, 259.95, 259.95, 320.0, 320.0]
IntArr = [89, 120, 140, 179, 199, 259, 259, 259, 320, 320]
findThree(FloatingArr, 779.85) // I want it to return [259.95, 259.95, 259,95]
>> None // Fail
findThree(FloatingArr, 777) // I want it to return [259,259,259]
>> [259, 259, 259] // Success
The algorithm does work, but it doesn't seem to work well with floating point numbers. What can I do to fix this?
For additional information, my first array is originally a list of string of prices, but in order to do math with them, I had to strip the "$" sign off. My approach in doing so is this
for x in range(len(prices)):
prices[x] = float(prices[x][1:]) // return all the prices without the "$" sign. Casting them to float.
If there is a much better way, please let me know. I feel as if this problem is not really about findThree() but rather how I modified the original prices array.
Edit: Seeing that it is indeed a floating point problem, I guess my next question would be what is the best way to convert a string to int after I strip off the "$" ?
It doesn't work because numbers like 89.95 typically cannot be stored exactly (because the base-two representation of 0.95 is a repeating decimal).
In general, with dealing with floating-point numbers, instead of comparing for exact equality via ==, you want to check if the numbers are "close enough" to be considered equal; typically done with abs(a - b) < SOME_THRESHOLD. The exact value of SOME_THRESHOLD depends on how accurate you want to be, and typically requires trial and error to get a good value.
In your specific case, because you're working with dollars and cents, you can simply convert to cents by multiplying by 100 and rounding to an integer (via round, because int will round ie 7.999999 to 7). Then, your set of numbers will just be integers, solving the rounding problem.
You can convert your prices from string to integers instead of converting them to floats. Let's assume that all prices have at most k digits after the decimal point(in initial string representation). Then 10^k * price is always a whole number. So you completely can get rid of floating-point computations.
Example: if there are at most two digits after the decimal point, $2.10 becomes 210 and $2.2 becomes 220. There is no need to use float even in intermediate computations because you can shift decimal point by two positions to the right(appending zeros if necessary) and then convert a string directly to an integer.
Here is an example of convert function:
def convert(price, max_digits):
""" price - a string representation of the price
max_digits - maximum number of digits after a decimal point
among all prices
"""
parts = price[1:].split('.')
if len(parts) == 2 and len(parts[1]) > 0:
return int(parts[0]) * 10 ** max_digits + \
int(parts[1]) * 10 ** (max_digits - len(parts[1]))
else:
return int(parts[0]) * 10 ** max_digits
I need to perform a multiplication operation on a fixed-point variable x (unsigned 16-bit integer [U16] type with binary point 6 [BP6]) with a coefficient A, which I know will always be between 0 and 1. Code is being written in C for a 32-bit embedded platform.
I know that if I were to also make this coefficient a U16 BP6, then I would end up with a U32 BP12 from the multiplication. I want to rescale this result back down to U16 BP6, so I just lop off the first 10 bits and the last 6.
However, since the coefficient is limited in precision by the number of fractional bits, and I do not necessarily need the full 10 bits of integer, I was thinking that I could just make the coefficient variable A a U16 BP15 to yield a more precise result.
I have worked out the following example (bear with me):
Let's say that x = 172.0 (decimal) and I want to use a coefficient A = 0.82 (decimal). The ideal decimal result would be 172.0 * 0.82 = 141.04.
In binary, x = 0010101100.000000.
If I am using BP6 for A, the binary representation will be either
A_1 = 0000000000.110100 = 0.8125 or
A_2 = 0000000000.110101 = 0.828125
(depending on whether value is based on floor or ceiling).
Performing the binary multiplication between x and either value of A yields (leaving out leading zeroes):
A_1 * x = 10001011.110000000000 = 139.75
A_2 * x = 10001110.011100000000 = 142.4375
In both cases, triming down the last 6 bits would not affect the result.
Now, if I expanded A to have BP15, then
A_3 = 0.110100011110110 = 0.82000732421875
and the resulting multiplication yields
A_3 * x = 10001101.000010101001000000000 = 141.041259765625
When trimming the extra 15 fractional bits, the result is
A_3 * x = 10001101.000010 = 141.03125
So it's pretty clear here that by expanding the coefficient to have more fractional bits yields a more precise result (at least in my example). Is this something which will hold true in general? Is this good/bad to use in practice? Am I missing or misunderstanding something?
EDIT: I should have said "accuracy" in place of "precision" here. I am looking for a result which is closer to my expected value rather than a result which contains more fractional bits.
Having done similar code, I'd say you what you are doing will hold true in general with the following concerns.
It is very easy to get unexpected overflow when shifting around your binary point. Rigorous testing/analysis and/or code detect is recommended. Notable failure: Ariane_5
You want precision, thus I disagree with "lop off ... last 6". Instead I recommend rounding your results as processing time allows. Use the MSBit to be lopped off to possibly adjust the result.
Might anyone be famiiar with tricks and techniques to coerce the set of valid floating point numbers to be a group under a multiplication based operation?
That is, given any two floating point numbers ("double a,b"), what sequence of operations, including multiply, will turn this into another valid floating point number? (A valid floating point number is anything 1-normalized, excluding NaN, denormals and -0.0).
To put this rough code:
double a = drand();
while ( forever )
{
double b = drand();
a = GROUP_OPERATION(a,b);
//invariant - a is a valid floating point number
}
Just multiply by itself doesn't work, because of NaNs. Ideally this would be a straight-line approach (avoiding "if above X, divide by Y" formulations).
If this can't work for all valid floating point numbers, is there a subset for which such an operation is available?
(The model I'm looking for is akin to integer multiplication in C - no matter what two integers get multiplied together, you always get an integer back).
(The model I'm looking for is akin to integer multiplication in C - no matter what two integers get multiplied together, you always get an integer back).
Integers modulo 2^N do not form a group - what integer multiplied by 2 gives 1? For integers to be a group under multiplication, you have to be modulo a prime number. (eg Z mod 7, 2*4 = 1, so 2 and 4 are each other's inverses)
For floating point values, simple multiplication or addition saturates to +/- Infinity, and there are no values which are the inverses of infinity, so either the set is not closed, or it lacks invertibility.
If on the other hand you want something similar to integer multiplication modulo a power of 2, then multiplication will do - there are elements without an inverse, so it's not a group, but it is closed - you always get a floating point value back. For subsets of floats which are a true group, see lakshmanaraj's answer.
Floating point numbers are backed by bits. That means that you can use the integer arithmetic on the integer representation of your floating point values and you will get a group.
Not sure this is very usefull though.
/* You have to find the integer type whose size correspond to your double */
typedef double float_t;
typedef long long int_t;
float_t group_operation(float_t a, float_t b)
{
int_t *ia, *ib, c;
assert(sizeof(float_t) == sizeof(int_t));
ia = &a;
ib = &b;
c = *ia * *ib;
return (float_t)c;
}
Floating point numbers never form a group in the sense you are talking about, because of rounding errors. Consider any of those horrible examples from numerical analysis class, like the fact that 0.1 can't be represented exactly in binary.
But then even computational ints don't form a group in that sense, since they're not closed under multiplication either. (Proof: compute the result of while true do x = x*x. At some point you'll exceed the word size, run out of resources for a BIGNUM, or something.)
update for #UnderAchievementAward:
-- added here so I can get formatting, unlike comments
Since I start with floating point (instead of "real" numbers), can't I avoid any of the 0.1 representational issues? The "x = x*x" problem is why additional operations are needed to keep the result in the valid range.
Okay, but then you're going to run into a situation where there will exist some x,y st 0 ≤ x,y < max where xy < 0. Or something equally non-intuitive.
The point is that you can certainly define a collection of operations that will look like a group on a finite representation set, but it's going to do weird things if you try to use it as the normal arithmetic operations.
If group operation is multiplication then
if n is the highest bit, then r1=1/power(2,n-1) is the least decimal that you can operate and the set
[r1,2 * r1,4 * r1,8 * r1...1] union [-r1, -2 * r1, -4 * r1,....-1] union [0] will be the group that you are expecting.
For integer [1,0,-1] is the group.
if Group operation can be any thing else,
then to form n set of valid groups,
A(r)=cos(2*Pi*r/n) from r=0 to n-1
and group operation is
COS(COSINV(A1)+COSINV(A2))
I don't know whether you want this.....
or if you want INFINITY set as a valid group then
simple answer :
GROUP OPERATION = AVG(A1,A2) = (A1+A2)/2
or some functions exists F which has FINV as it's inverse and then FINV(F(A1)+F(A2)/2)
Example of F is Log, inverse, square etc ..
double a = drand();
while ( forever )
{
double b = drand();
a = (a+b)/2
//invariant - a is a valid floating point number
}
or if you want INFINITY set of DIGITAL format as a valid group then
Let L be the lowest float number and H be highest float number
then GROUP OPERATION = AVG(A1,A2, L, H) = (A1+A2+L+H)/4
this operation will always be within L & H for all Positive numbers.
You can take L as four times the lowest decimal number and H as the (highest decimal number /4) for practical purpose.
double l = (0.0000000000000000000000000//1) * 4
double h = (0xFFFFFFFFFFFFFFFFFFFFFF///F) / 4
double a = abs(drand()) / 4;
while ( forever )
{
double b = abs(drand()) / 4;
a = (a+b+l+h)/4
//invariant - a is a valid floating point number
}
this has a subset of all possitive float number / 4.
The integers don't form a group under multiplication -- 0 doesn't have
an inverse.