Splines (the piecewise cubic polynomial form) can be written as:
s = x - x[k]
y = y[k] + a[k]*s + b[k]*s*s + c[k]*s*s*s
where x[k] < x < x[k+1], the curve passes through each (x[k], y[k]) point, and a,b,c are arrays of coefficients describing the slope and shape. This all works fine in floating point, and there are plenty of ways to calculate a,b,c for different kinds of splines. However...
How can this be approximated in integer arithmetic?
One of the tricky parts is that any approximation should, ideally, be continuous, in other words using x=x[k+1] and the coefficients from the k-th segment, the result should be y[k+1] except for rounding errors. In other words, for a straight segment, y[k+1] == y[k] + a[k]*(x[k+1] - x[k]), and curvy segments only deviate from this in the middle but not at either end. This is guaranteed by construction in the case of floating point, but even a small coefficient change from rounding can throw it off quite a bit.
Another tricky part is that, in general, the magnitude of the higher-order coefficients is much smaller - but not always, esp. not at sharp "corners". It may still make sense to scale them up by the typical size of s to the power of whatever order they are, so they are not rounded of to zero as integers, but that would seem to trade off resolution in curvature for max possible corner sharpness.
First try at an integer version:
y = y[k] + (a[k] + (b[k] + c[k]*s)*s)*s
Then use integer multiply (intended for 16bit values, 32bit arithmetic):
#define q (1<<16)
#define mult(x, y) ((x * y) / q)
y = y[k] + mult(mult(mult(c[k], s) + b[k], s) + a[k], s)
This looks good in theory, but I'm not sure it's the best possible approach, or how to tell systematically what the best possible approach is.
Related
I am currently doing some 2D geometry in C, mostly lines intersecting themselves. Those lines have all kinds of slopes : from 0.001 to 1000 (examples, I don't even know).
I was using floats until now and did not have to worry about whether the value was very small (and then floating point would store 0,0011 as 1e-3 with no rounding) or very high (and then 1001 would be stored as 1e3), with in both cases little loss of precision where relevant.
But now I want to try without floats, with integers. How to maintain precision in my calculations ? I could have a flag telling me whether the slope is going to be big or small and then consider a tenth of big slopes and ten times small slopes so that rounding is no problem for small slopes and there is no overflow in the case of big slopes. But that feels like a headache.
Basically I need to still be able to differentiate between a slope of 0.2 and 0.4 and also on the overflow side of things a slope of 1000 and 2000 (supposing that ints overflow at 1000 - less of a problem here).
Any other ideas?
Store the slope as a pair of integers
struct slope {
int delta_y;
int delta_x;
};
This allows for a wide range of slopes like 0 and +/- 1/INT_MAX ... +/- INT_MAX, even vertical. With careful coding, exact computations can be had.
Tardy credit: This is much like #Ignacio Vazquez-Abrams comment.
Generally speaking, with lines of arbitrary orientation it is not recommended to work with the slope/intercept representation y = mx + p, but with the implicit equation a x + b y + c = 0. The latter is more isotropic, supports vertical lines and gives you extra flexibility to scale the coefficients.
Meeting #chux's answer, the coefficients can be the deltas, Dy x - Dx y + c = 0 (assuming that the lines are defined by two points, Dx and Dy are likely to not overflow). Overflow is still possible on c, and you can use the variant Dy (x - x0) - Dx (y - y0) = 0.
Anyway, intermediate computations such as intersections may require larger ranges, i.e. double length integers.
The idea of flagging the large/low value is a little counterproductive: it is actually a primitive way of doing floating-point, i.e. separating the scale from the mantissa. Working this way, you will somehow re-design a foating-point system, less powerful than the built-in type and costing you sweat and tears.
Unfortunately, high range arithmetic can't be avoided. Indeed, the intersection of two straight lines is given by the Cramer formulas
x = (c b' - c' b) / (a b' - a' b),
y = (a c' - a' c) / (a b' - a' b)
where the products to be evaluated are one order of magnitude larger than the initial coefficients. This is explained by the fact that quasi-parallel lines have far-away intersections.
Look up fixed point arithmetic if you want to use int in a general way.
You can also design your algorithms so that you do every computation in such a way that you don't need sub-integer accuracy (for example look up Bresenham's line and circle drawing algorithms).
To your particular problem, you could try to keep quotient and fraction separately, in other words use rational numbers. Or ot put it another way, have delta X and delta Y as two numbers.
In various contexts, for example for the argument reduction for mathematical functions, one needs to compute (a - K) / (a + K), where a is a positive variable argument and K is a constant. In many cases, K is a power of two, which is the use case relevant to my work. I am looking for efficient ways to compute this quotient more accurately than can be accomplished with the straightforward division. Hardware support for fused multiply-add (FMA) can be assumed, as this operation is provided by all major CPU and GPU architectures at this time, and is available in C/C++ via the functionsfma() and fmaf().
For ease of exploration, I am experimenting with float arithmetic. Since I plan to port the approach to double arithmetic as well, no operations using higher than the native precision of both argument and result may be used. My best solution so far is:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 1 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
t = fmaf (q, -2.0f*K, m);
e = fmaf (q, -m, t);
q = fmaf (r, e, q);
For arguments a in the interval [K/2, 4.23*K], code above computes the quotient almost correctly rounded for all inputs (maximum error is exceedingly close to 0.5 ulps), provided that K is a power of 2, and there is no overflow or underflow in intermediate results. For K not a power of two, this code is still more accurate than the naive algorithm based on division. In terms of performance, this code can be faster than the naive approach on platforms where the floating-point reciprocal can be computed faster than the floating-point division.
I make the following observation when K = 2n: When the upper bound of the work interval increases to 8*K, 16*K, ... maximum error increases gradually and starts to slowly approximate the maximum error of the naive computation from below. Unfortunately, the same does not appear to be true for the lower bound of the interval. If the lower bound drops to 0.25*K, the maximum error of the improved method above equals the maximum error of the naive method.
Is there a method to compute q = (a - K) / (a + K) that can achieve smaller maximum error (measured in ulp vs the mathematical result) compared to both the naive method and the above code sequence, over a wider interval, in particular for intervals whose lower bound is less than 0.5*K? Efficiency is important, but a few more operations than are used in the above code can likely be tolerated.
In one answer below, it was pointed out that I could enhance accuracy by returning the quotient as an unevaluated sum of two operands, that is, as a head-tail pair q:qlo, i.e. similar to the well-known double-float and double-double formats. In my code above, this would mean changing the last line to qlo = r * e.
This approach is certainly useful, and I had already contemplated its use for an extended-precision logarithm for use in pow(). But it doesn't fundamentally help with the desired widening of the interval on which the enhanced computation provides more accurate quotients. In a particular case I am looking at, I would like to use K=2 (for single precision) or K=4 (for double precision) to keep the primary approximation interval narrow, and the interval for a is roughly [0,28]. The practical problem I am facing is that for arguments < 0.25*K the accuracy of the improved division is not substantially better than with the naive method.
If a is large compared to K, then (a-K)/(a+K) = 1 - 2K / (a + K) will give a good approximation. If a is small compared to K, then 2a / (a + K) - 1 will give a good approximation. If K/2 ≤ a ≤ 2K, then a-K is an exact operation, so doing the division will give a decent result.
One possibility is to track error of m and p into m1 and p1 with classical Dekker/Schewchuk:
m=a-k;
k0=a-m;
a0=k0+m;
k1=k0-k;
a1=a-a0;
m1=a1+k1;
p=a+k;
k0=p-a;
a0=p-k0;
k1=k-k0;
a1=a-a0;
p1=a1+k1;
Then, correct the naive division:
q=m/p;
r0=fmaf(p,-q,m);
r1=fmaf(p1,-q,m1);
r=r0+r1;
q1=r/p;
q=q+q1;
That'll cost you 2 divisions, but should be near half ulp if I didn't screw up.
But these divisions can be replaced by multiplications with inverse of p without any problem, since the first incorrectly rounded division will be compensated by remainder r, and second incorrectly rounded division does not really matter (the last bits of correction q1 won't change anything).
I don't really have an answer (proper floating point error analyses are very tedious) but a few observations:
Fast reciprocal instructions (such as RCPSS) are not as accurate as division, so you may see a reduction in accuracy if using these.
m is computed exactly if a ∈ [0.5×Kb, 21+n×Kb), where Kb is the power of 2 below K (or K itself if K is a power of 2), and n is the number of trailing zeros in the significand of K (i.e. if K is a power of 2, then n=23).
This is similar to a simplified form of the div2 algorithm from Dekker (1971): to expand the range (particularly the lower bound), you'll probably have to incorporate more correction terms from this (i.e. store m as the sum of 2 floats, or use a double).
Since my goal is to merely widen the interval on which accurate results are achieved, rather than to find a solution that works for all possible values of a, making use of double-float arithmetic for all intermediate computation seems too costly.
Thinking some more about the problem, it is clear that the computation of the remainder of the division, e in the code from my question, is the crucial part of achieving more accurate result. Mathematically, the remainder is (a-K) - q * (a+K). In my code, I simply used m to represent (a-K) and represented (a+k) as m + 2*K, as this delivers numerically superior results to the straightforward representation.
With relatively small additional computational cost, (a+K) can be represented as a double-float, that is, a head-tail pair p:plo, which leads to the following modified version of my original code:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 2 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
mx = fmaxf (a, K);
mn = fminf (a, K);
plo = (mx - p) + mn;
t = fmaf (q, -p, m);
e = fmaf (q, -plo, t);
q = fmaf (r, e, q);
Testing shows that this delivers nearly correctly rounded results for a in [K/2, 224*K), allowing for a substantial increase to the upper bound of the interval on which accurate results are achieved.
Widening the interval at the lower end requires the more accurate representation of (a-K). We can compute this as a double-float head-tail pair m:mlo, which leads to the following code variant:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 3 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
plo = (a < K) ? ((K - p) + a) : ((a - p) + K);
mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
t = fmaf (q, -p, m);
e = fmaf (q, -plo, t);
e = e + mlo;
q = fmaf (r, e, q);
Exhaustive testing hows that this delivers nearly correctly rounded results for a in the interval [K/224, K*224). Unfortunately, this comes at a cost of ten additional operations compared to the code in my question, which is a steep price to pay to get the maximum error from around 1.625 ulps with the naive computation down to near 0.5 ulp.
As in my original code from the question, one can express (a+K) in terms of (a-K), thus eliminating the computation of the tail of p, plo. This approach results in the following code:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 4 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
t = fmaf (q, -2.0f*K, m);
t = fmaf (q, -m, t);
e = fmaf (q - 1.0f, -mlo, t);
q = fmaf (r, e, q);
This turns out to be advantageous if the main focus is decreasing the lower limit of the interval, which is my particular focus as explained in the question. Exhaustive testing of the single-precision case shows that when K=2n nearly correctly rounded results are produced for values of a in the interval [K/224, 4.23*K]. With a total of 14 or 15 operations (depending on whether an architecture supports full predication or just conditional moves), this requires seven to eight more operations than my original code.
Lastly, one might base the residual computation directly on the original variable a to avoid the error inherent in the computation of m and p. This leads to the following code that, for K = 2n, computes nearly correctly rounded results for a in the interval [K/224, K/3):
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 5 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
t = fmaf (q + 1.0f, -K, a);
e = fmaf (q, -a, t);
q = fmaf (r, e, q);
If you can relax the API to return another variable that models the error, then the solution becomes much simpler:
float foo(float a, float k, float *res)
{
float ret=(a-k)/(a+k);
*res = fmaf(-ret,a+k,a-k)/(a+k);
return ret;
}
This solution only handles truncation error of division, but does not handle the loss of precision of a+k and a-k.
To handle those errors, I think I need to use double precision, or bithack to use fixed point.
Test code is updated to artificially generate non zero least significant bits
in the input
test code
https://ideone.com/bHxAg8
The problem is the addition in (a + K). Any loss of precision in (a + K) is magnified by the division. The problem isn't the division itself.
If the exponents of a and K are the same (almost) no precision is lost, and if the absolute difference between the exponents is greater than the significand size then either (a + K) == a (if a has larger magnitude) or (a + K) == K (if K has larger magnitude).
There is no way to prevent this. Increasing the significand size (e.g. using 80-bit "extended double" on 80x86) only helps widen the "accurate result range" slightly. To understand why, consider smallest + largest (where smallest is the smallest positive denormal a 32-bit floating point number can be). In this case (for 32-bit floats) you'd need a significand size of about 260 bits for the result to avoid precision loss completely. Doing (e.g.) temp = 1/(a + K); result = a * temp - K / temp; won't help much either because you've still got exactly the same (a + K) problem (but it would avoid a similar problem in (a - K)). Also you can't do result = anything / p + anything_error/p_error because division doesn't work like that.
There are only 3 alternatives I can think of to get close to 0.5 ulps for all possible positive values of a that can fit in 32-bit floating point. None are likely to be acceptable.
The first alternative involves pre-computing a lookup table (using "big real number" maths) for every value of a, which (with some tricks) ends up being about 2 GiB for 32-bit floating point (and completely insane for 64-bit floating point). Of course if the range of possible values of a is smaller than "any positive value that can fit in a 32-bit float" the size of the lookup table would be reduced.
The second alternative is to use something else ("big real number") for the calculation at run-time (and convert to/from 32-bit floating point).
The third alternative involves, "something" (I don't know what it's called, but it's expensive). Set the rounding mode to "round to positive infinity" and calculate temp1 = (a + K); if(a < K) temp2 = (a - K); then switch to "round to negative infinity" and calculate if(a >= K) temp2 = (a - K); lower_bound = temp2 / temp1;. Next do a_lower = a and decrease a_lower by the smallest amount possible and repeat the "lower_bound" calculation, and keep doing that until you get a different value for lower_bound, then revert back to the previous value of a_lower. After that you do essentially the same (but opposite rounding modes, and incrementing not decrementing) to determine upper_bound and a_upper (starting with the original value of a). Finally, interpolate, like a_range = a_upper - a_lower; result = upper_bound * (a_upper - a) / a_range + lower_bound * (a - a_lower) / a_range;. Note that you will want to calculate an initial upper and lower bound and skip all of this if they're equal. Also be warned that this is all "in theory, completely untested" and I probably borked it somewhere.
Mainly what I'm saying is that (in my opinion) you should give up and accept that there's nothing that you can do to get close to 0.5 ulp. Sorry.. :)
I have a loop like this:
for(uint64_t i=0; i*i<n; i++) {
This requires doing a multiplication every iteration. If I could calculate the sqrt before the loop then I could avoid this.
unsigned cut = sqrt(n)
for(uint64_t i=0; i<cut; i++) {
In my case it's okay if the sqrt function rounds up to the next integer but it's not okay if it rounds down.
My question is: is the sqrt function accurate enough to do this for all cases?
Edit: Let me list some cases. If n is a perfect square so that n = y^2 my question would be - is cut=sqrt(n)>=y for all n? If cut=y-1 then there is a problem. E.g. if n = 120 and cut = 10 it's okay but if n=121 (11^2) and cut is still 10 then it won't work.
My first concern was the fractional part of float only has 23 bits and double 52 so they can't store all the digits of some 32-bit or 64-bit integers. However, I don't think this is a problem. Let's assume we want the sqrt of some number y but we can't store all the digits of y. If we let the fraction of y we can store be x we can write y = x + dx then we want to make sure that whatever dx we choose does not move us to the next integer.
sqrt(x+dx) < sqrt(x) + 1 //solve
dx < 2*sqrt(x) + 1
// e.g for x = 100 dx < 21
// sqrt(100+20) < sqrt(100) + 1
Float can store 23 bits so we let y = 2^23 + 2^9. This is more than sufficient since 2^9 < 2*sqrt(2^23) + 1. It's easy to show this for double as well with 64-bit integers. So although they can't store all the digits as long as the sqrt of what they can store is accurate then the sqrt(fraction) should be sufficient. Now let's look at what happens for integers close to INT_MAX and the sqrt:
unsigned xi = -1-1;
printf("%u %u\n", xi, (unsigned)(float)xi); //4294967294 4294967295
printf("%u %u\n", (unsigned)sqrt(xi), (unsigned)sqrtf(xi)); //65535 65536
Since float can't store all the digits of 2^31-2 and double can they get different results for the sqrt. But the float version of the sqrt is one integer larger. This is what I want. For 64-bit integers as long as the sqrt of the double always rounds up it's okay.
First, integer multiplication is really quite cheap. So long as you have more than a few cycles of work per loop iteration and one spare execute slot, it should be entirely hidden by reorder on most non-tiny processors.
If you did have a processor with dramatically slow integer multiply, a truly clever compiler might transform your loop to:
for (uint64_t i = 0, j = 0; j < cut; j += 2*i+1, i++)
replacing the multiply with an lea or a shift and two adds.
Those notes aside, let’s look at your question as stated. No, you can’t just use i < sqrt(n). Counter-example: n = 0x20000000000000. Assuming adherence to IEEE-754, you will have cut = 0x5a82799, and cut*cut is 0x1ffffff8eff971.
However, a basic floating-point error analysis shows that the error in computing sqrt(n) (before conversion to integer) is bounded by 3/4 of an ULP. So you can safely use:
uint32_t cut = sqrt(n) + 1;
and you’ll perform at most one extra loop iteration, which is probably acceptable. If you want to be totally precise, instead use:
uint32_t cut = sqrt(n);
cut += (uint64_t)cut*cut < n;
Edit: z boson clarifies that for his purposes, this only matters when n is an exact square (otherwise, getting a value of cut that is “too small by one” is acceptable). In that case, there is no need for the adjustment and on can safely just use:
uint32_t cut = sqrt(n);
Why is this true? It’s pretty simple to see, actually. Converting n to double introduces a perturbation:
double_n = n*(1 + e)
which satisfies |e| < 2^-53. The mathematical square root of this value can be expanded as follows:
square_root(double_n) = square_root(n)*square_root(1+e)
Now, since n is assumed to be a perfect square with at most 64 bits, square_root(n) is an exact integer with at most 32 bits, and is the mathematically precise value that we hope to compute. To analyze the square_root(1+e) term, use a taylor series about 1:
square_root(1+e) = 1 + e/2 + O(e^2)
= 1 + d with |d| <~ 2^-54
Thus, the mathematically exact value square_root(double_n) is less than half an ULP away from[1] the desired exact answer, and necessarily rounds to that value.
[1] I’m being fast and loose here in my abuse of relative error estimates, where the relative size of an ULP actually varies across a binade — I’m trying to give a bit of the flavor of the proof without getting too bogged down in details. This can all be made perfectly rigorous, it just gets to be a bit wordy for Stack Overflow.
All my answer is useless if you have access to IEEE 754 double precision floating point, since Stephen Canon demonstrated both
a simple way to avoid imul in loop
a simple way to compute the ceiling sqrt
Otherwise, if for some reason you have a non IEEE 754 compliant platform, or only single precision, you could get the integer part of square root with a simple Newton-Raphson loop. For example in Squeak Smalltalk we have this method in Integer:
sqrtFloor
"Return the integer part of the square root of self"
| guess delta |
guess := 1 bitShift: (self highBit + 1) // 2.
[
delta := (guess squared - self) // (guess + guess).
delta = 0 ] whileFalse: [
guess := guess - delta ].
^guess - 1
Where // is operator for quotient of integer division.
Final guard guess*guess <= self ifTrue: [^guess]. can be avoided if initial guess is fed in excess of exact solution as is the case here.
Initializing with approximate float sqrt was not an option because integers are arbitrarily large and might overflow
But here, you could seed the initial guess with floating point sqrt approximation, and my bet is that the exact solution will be found in very few loops. In C that would be:
uint32_t sqrtFloor(uint64_t n)
{
int64_t diff;
int64_t delta;
uint64_t guess=sqrt(n); /* implicit conversions here... */
while( (delta = (diff=guess*guess-n) / (guess+guess)) != 0 )
guess -= delta;
return guess-(diff>0);
}
That's a few integer multiplications and divisions, but outside the main loop.
What you are looking for is a way to calculate a rational upper bound of the square root of a natural number. Continued fraction is what you need see wikipedia.
For x>0, there is
.
To make the notation more compact, rewriting the above formula as
Truncate the continued fraction by removing the tail term (x-1)/2's at each recursion depth, one gets a sequence of approximations of sqrt(x) as below:
Upper bounds appear at lines with odd line numbers, and gets tighter. When distance between an upper bound and its neighboring lower bound is less than 1, that approximation is what you need. Using that value as the value of cut, here cut must be a float number, solves the problem.
For very large number, rational number should be used, so no precision is lost during conversion between integer and floating point number.
I want to do point subtraction on an elliptic curve on a prime field. I tried taking the points to be subtracted as (x,-y log(p)) but my answer doesn't seem to match.
This is how I tried to do the subtraction:
s9=point_addition(s6.a,s6.b,((s8.a)%211) ,-((s8.b)%211));
here s9, s6 and s8 are all structures with two int.
and this is my function which does the point addition:
structure point_addition(int x1, int y1, int x2, int y2)
{
int s,xL,yL;
if((x1-x2)!=0)
{
if ((((y1-y2)/(x1-x2)) % 211)>0)
s=(((y1-y2)/(x1-x2)) % 211);
else
s=(((y1-y2)/(x1-x2)) % 211) + 211;
if ((((s*s)-(x1+x2)) % 211)>0)
xL= (((s*s)-(x1+x2)) % 211) ;
else
xL= (((s*s)-(x1+x2)) % 211) + 211;
if(((-y1+s*(x1-x2)) % 211)>0)
yL= ((-y1+s*(x1-xL)) % 211);
else
yL= ((-y1+s*(x1-x2)) % 211) + 211;
}
else
{
xL= 198 ;
yL= 139;
}
s7.a= xL;
s7.b= yL;
return s7 ;
}
The programs doesn't seem to give me the correct co-ordinates Please help me with this coding for elliptic curve cryptography.
See GregS's comment about division mod p. You need to find the inverse of the denominator and then multiply. To calculate the modular inverse you could use the extended euclidean algorithm.
Also the way you're negating the y coordinate then adding 211 later is a bit odd. Best to keep field elements in the proper range when passing as arguments, e.g. to obtain -y mod p, use p-y.
And I assume this is just a learning exercise since you're using a very small field :)
I don`t understand what you are doing there exactly, what your log(p) is supposed to mean and where your domain parameters enter, but subtracting is easy:
Negate the y-coordinate (-Y = modulus - y) and then plainly add as usual.
If you want a reference for your calculations, you might use my open source software "Academic Signature"
from this link
It is quite transparent with its calculations and produces e.g. results of ECDSA-signatures in human readable hex notation. I am not sure at the moment though, if it can do calculations with such short moduli you are working with.
The manual featuring descriptions on how to program ECC-operations correctly and how to use the software is there:
Link to ecc Manual
Regards
Michael Anders
I am currently tightening floating-point numerics for an estimate of a value. (It's: p(k,t) for those who are interested.) Essentially, the utility can never yield an under-estimate of this value: the security of probable prime generation depends on a numerically robust implementation. While output results agree with the published values, I have used the DBL_EPSILON value to ensure that division, in particular, yields a result that is never less than the true value:
Consider: double x, y; /* assigned some values... */
The evaluation: r = x / y; occurs frequently, but these (finite precision) results may truncate significant digits from the true result - a possibly infinite precision rational expansion. I currently try to mitigate this by applying a bias to the numerator, i.e.,
r = ((1.0 + DBL_EPSILON) * x) / y;
If you know anything about this subject, p(k,t) is typically much smaller than most estimates - but it's simply not good enough to dismiss the issue with this "observation". I can of course state:
(((1.0 + DBL_EPSILON) * x) / y) >= (x / y)
Of course, I need to ensure that the 'biased' result is greater than, or equal to, the 'exact' value. While I am certain it has to do with manipulating or scaling DBL_EPSILON, I obviously want the 'biased' result to exceed the 'exact' result by a minimum - demonstrable under IEEE-754 arithmetic assumptions.
Yes, I've looked though Goldberg's paper, and I've searched for a robust solution. Please don't suggest manipulation of rounding modes. Ideally, I'm after an answer by someone with a very good grasp on floating-point theorems, or knows of a very well illustrated example.
EDIT: To clarify, (((1.0 + DBL_EPSILON) * x) / y) or a form (((1.0 + c) * x) / y), is not a prerequisite. This was simply an approach I was using as 'probably good enough', without having provided a solid basis for it. I can state that the numerator and denominator will not be special values: NaNs, Infs, etc., nor will the denominator be zero.
First: I know that you don't want to set the rounding mode, but it really should be said that
in terms of precision, as others have noted, setting the rounding mode will produce as good of an answer as possible. Specifically, assuming that x and y are both positive (which seems to be the case, but hasn't been explicitly stated in your question), the following is a standard C snippet with the desired effect[1]:
#include <math.h>
#pragma STDC FENV_ACCESS on
int OldRoundingMode = fegetround();
fesetround(FE_UPWARD);
r = x/y;
fesetround(OldRoundingMode);
Now, that aside, there are legitimate reasons not to want to change the rounding mode (some platforms don't support round-to-plus-infinity, on some platforms changing the rounding mode introduces a large serializing stall, etc etc), and your desire not to do so shouldn't be brushed aside so casually. So, respecting your question, what else can we do?
If your platform supports fused multiply-add, there's a very elegant solution available to you:
#include <math.h>
r = x/y;
if (fma(r,y,-x) < 0) r = nextafter(r, INFINITY);
On platforms with hardware fma support, this is very efficient. Even if fma( ) is implemented in software, it may be acceptable. This approach has the virtue that it will deliver the same result as would changing the rounding mode; that is, the tightest bound possible.
If your platform's C library is antediluvian and does not provide fma, there is still hope. Your claimed statement is correct (assuming no denormal values, at least -- I would need to think more about what happens for denormals); (1.0+DBL_EPSILON)*x/y really is always greater than or equal to the infinitely precise x/y. It will sometimes be one ulp larger than the smallest value with this property, but that's a very small and probably acceptable margin. The proof of these claims is pretty fussy, and probably not suitable for StackOverflow, but I'll give a quick sketch:
Ignoring denormals, it suffices to restrict ourselves to x, y in [1.0, 2.0).
(1.0 + eps)*x >= x + eps > x. To see this, observe:
(1.0 + eps)*x = x + x*eps >= x + eps > x.
Let P be the mathematically precise x/y. We have:
(1.0 + eps)*x/y >= (x + eps)/y = x/y + eps/y = P + eps/y
Now, y is bounded above by 2, so this gives us:
(1.0 + eps)*x/y > P + eps/2
which is sufficient to guarantee that the result rounds to a value >= P. This also shows us the way to a tighter bound. We could instead use nextafter(x,INFINITY)/y to get the desired effect with a tighter bound in many cases. (nextafter(x,INFINITY) is always x + ulp, whereas (1.0 + eps)*x will be x + 2ulp half of the time. If you want to avoid calling the nextafter library function, you can use (x + (0.75*DBL_EPSILON)*x) instead to get the same result, under the working assumption of positive normal values).
In order to be really pedantically correct, this would become significantly more complicated. No one really writes code like this, but it would be along these lines:
#include <math.h>
#pragma STDC FENV_ACCESS on
#if defined FE_UPWARD
int OldRoundingMode = fegetround();
if (OldRoundingMode < 0) goto Error;
if (fesetround(FE_UPWARD)) goto Error;
r = x/y;
if (fesetround(OldRoundingMode)) goto TrulyHosed;
return r;
TrulyHosed:
// we established the desired rounding mode and did our computation,
// but now we can't set it back to the original mode. I have no idea
// how you handle this gracefully.
Error:
#else
// we can't establish the desired rounding mode, so fall back on
// something else.