Bignum library, slow prime generator - c

I am developing a bignum library: http://pastebin.com/nFgF3zjW
I implemented the Miller-Rabin algorithm (isprime()), but it is extremely slow, compared to for example OpenSSL's BN_is_prime_fasttest.
I tried profiling and the functions that are executed the most are bn_shr_atomic and bn_cmp.
Any idea how I can make this faster?

The GNU Multiple Precision Arithmetic library implements Miller-Rabin. It's documentation is located here:
http://gmplib.org/manual/Number-Theoretic-Functions.html#Number-Theoretic-Functions
I would suggest examining their implementation for pointers on speeding up your computation. However, arbitrary precision arithmetic is inherently going to be slower than working with numbers that fit in registers.
Edit:
There is also a trade-off between the algorithm used and the quality of the resulting probability. That said, I'm not sure what test OpenSSL uses.

Big guess: If you really want to use your own library, I'd first replace the division algorithm by the long division.
To validate my guess: you have cmp and shr in the inner loop of your divison, are those calls the major contributor in your profile or do they come from somewhere else? In general, when you profile you should first look at higher level functions which are big contributor, changing algorithms there is usually more benefical than tuning low level functions.

Related

Numerical stability of Simplex Algorithm

Edit: Simplex the mathematical optimization algorithm, not to be confused with simplex noise or triangulation.
I'm implementing my own linear programming solver and I would like to do so using 32bit floats. I know Simplex is very sensitive to the precision of the numbers because it performs lots of calculations and if too little precision is used, rounding errors may occur. But still, I would like to implement it using 32bit floats so I can make the instructions 4-wide, that is, so I can use SIMD to perform 4 calculations at a time. I'm aware that I could use doubles and make instructions 2-wide, but 4 is greater than 2 :)
I have had problems with my floating point implementation where the solution was suboptimal or the problem was said to be unfeasible. This happens especially with mixed integer linear programs, which I solve with the branch and bound method.
So my question is: how can I prevent as much as possible having rounding errors resulting in unfeasible, unbounded or suboptimal solutions?
I know one thing I can do is to scale the input values so that they are close to one (http://lpsolve.sourceforge.net/5.5/scaling.htm). Is there something else I can do?
Yes, I tried to implement an algorithm for the Extended Knapsack problem using the Branch and bound method and Greedy Algorithm as a heuristic, is the exact analogue of the simplex running with a pivoting strategy that chooses the largest objective increase.
I had problems with the numerical stabilities of the algorithm too.
Personally, I don't think there is an easy way to eliminate the issues if we keep using the floating-point, but there is a way to detect the instability during the branching process.
I think, via experiment instead of rigorous maths on Stability Analysis, the majority of errors propagate through an integer solution that is extremely close to the constraints of the system.
Given any integer solution, we figure out the slack for that solution, and if the elements of the vector are extremely small, or on the magnitude of 1e-14 to 1e-15, then stop the branching and report instability.

Looking for Ansi C89 arbitrary precision math library

I wrote an Ansi C compiler for a friend's custom 16-bit stack-based CPU several years ago but I never got around to implementing all the data types. Now I would like to finish the job so I'm wondering if there are any math libraries out there that I can use to fill the gaps. I can handle 16-bit integer data types since they are native to the CPU and therefore I have all the math routines (ie. +, -, *, /, %) done for them. However, since his CPU does not handle floating point then I have to implement floats/doubles myself. I also have to implement the 8-bit and 32-bit data types (bother integer and floats/doubles). I'm pretty sure this has been done and redone many times and since I'm not particularly looking forward to recreating the wheel I would appreciate it if someone would point me at a library that can help me out.
Now I was looking at GMP but it seems to be overkill (library must be absolutely huge, not sure my custom compiler would be able to handle it) and it takes numbers in the form of strings which would be wasteful for obvious reasons. For example :
mpz_set_str(x, "7612058254738945", 10);
mpz_set_str(y, "9263591128439081", 10);
mpz_mul(result, x, y);
This seems simple enough, I like the api... but I would rather pass in an array rather than a string. For example, if I wanted to multiply two 32-bit longs together I would like to be able to pass it two arrays of size two where each array contains two 16-bit values that actually represent a 32-bit long and have the library place the output into an output array. If I needed floating point then I should be able to specify the precision as well.
This may seem like asking for too much but I'm asking in the hopes that someone has seen something like this.
Many thanks in advance!
Let's divide the answer.
8-bit arithmetic
This one is very easy. In fact, C already talks about this under the term "integer promotion". This means that if you have 8-bit data and you want to do an operation on them, you simply pad them with zero (or one if signed and negative) to make them 16-bit. Then you proceed with the normal 16-bit operation.
32-bit arithmetic
Note: so long as the standard is concerned, you don't really need to have 32-bit integers.
This could be a bit tricky, but it is still not worth using a library for. For each operation, you would need to take a look at how you learned to do them in elementary school in base 10, and then do the same in base 216 for 2 digit numbers (each digit being one 16-bit integer). Once you understand the analogy with simple base 10 math (and hence the algorithms), you would need to implement them in assembly of your CPU.
This basically means loading the most significant 16 bit on one register, and the least significant in another register. Then follow the algorithm for each operation and perform it. You would most likely need to get help from overflow and other flags.
Floating point arithmetic
Note: so long as the standard is concerned, you don't really need to conform to IEEE 754.
There are various libraries already written for software emulated floating points. You may find this gcc wiki page interesting:
GNU libc has a third implementation, soft-fp. (Variants of this are also used for Linux kernel math emulation on some targets.) soft-fp is used in glibc on PowerPC --without-fp to provide the same soft-float functions as in libgcc. It is also used on Alpha, SPARC and PowerPC to provide some ABI-specified floating-point functions (which in turn may get used by GCC); on PowerPC these are IEEE quad functions, not IBM long double ones.
Performance measurements with EEMBC indicate that soft-fp (as speeded up somewhat using ideas from ieeelib) is about 10-15% faster than fp-bit and ieeelib about 1% faster than soft-fp, testing on IBM PowerPC 405 and 440. These are geometric mean measurements across EEMBC; some tests are several times faster with soft-fp than with fp-bit if they make heavy use of floating point, while others don't make significant use of floating point. Depending on the particular test, either soft-fp or ieeelib may be faster; for example, soft-fp is somewhat faster on Whetstone.
One answer could be to take a look at the source code for glibc and see if you could salvage what you need.

Profiling floating point usage in C

Is there an easy way to count the number of multiplications actually executed by a piece of standard C code? The code I have in mind basically just does additions and multiplications, and it's the multiplications that are of primary interest, but it wouldn't hurt to get counts of the other operations as well.
If it were an option, I suppose I could go around replacing 'a * b' with 'multiply(a, b)' and write a cover function for the native * operator, b/c I really don't care about time performance during this test, but the primary objection to doing that is having to re-work a pile of source code just to run the test.
I have no objection to re-compiling the source, perhaps against some library or with obscure (afaik) options. Valgrind came to mind, but if I understand valgrind's purpose, that's more about tracing values than counting operations.
Compile the source code into assembly language and then search for the multiply instructions.
Note that the optimization level can greatly affect the number that appear. For loops, you would have to determine the scope of multiplies within a loop and factor that into the result, but if the code is fairly constrained or limited in extent, that should be straightforward.
Note: a shameless extrapolation of my comment for as much rep as I can skim.
PAPI has two high-level API functions called PAPI_flips and PAPI_flops which can be used to record the FLOPS as well as the number of floating point operations. Additionally, PAPI offers lots of other performance counter monitoring capability, depending on your processor architecture... cache, bus, memory, branches, etc. I think there is support or support is emerging for graphics accelerators and CUDA/GPGPU.
PAPI will need to be installed on your system, but I think it's widespread enough that installation wouldn't be too painful, if you know what you're doing.
The nice thing about PAPI is that you don't need to know anything about the code; just instrument it (the interface is the same as a stopwatch for FLOPS) and run it. It's based on the actual dynamic execution of your program, so it takes into account things that are hard to account for analytically, such as (pseudo-)random behavior, user/variable input, and related branches.
If your compiler supports soft-float (i.e. using functions with integer implementations to emulate floating-point), you could compiler your program in that mode (-msoft-float in GCC), and use your favorite profiling tool to measure how many times they are invoked.
Many processors also have performance counters that can count the number of floating-point operations that have been retired. Depending on the hardware and OS, you may or may not need some amount of kernel support to take advantage of them.
The best that I can think of is (assuming you're running gdb):
If you could identify the points were multiplications are occurring, you could then set tracepoints just prior to the multiplication (or perhaps just after them depending on the details), then run the program and count the number of tracepoint dumps.
Yes, it is very crude. Certainly there are other solutions; however, I would hesitate to trash my stack for something as simple as a count.

Efficiency of arcsin computation from sine lookup table

I have implemented a lookup table to compute sine/cosine values in my system. I now need inverse trigonometric functions (arcsin/arccos).
My application is running on an embedded device on which I can't add a second lookup table for arcsin as I am limited in program memory. So the solution I had in mind was to browse over the sine lookup table to retrieve the corresponding index.
I am wondering if this solution will be more efficient than using the standard implementation coming from the math standard library.
Has someone already experimented on this?
The current implementation of the LUT is an array of the sine values from 0 to PI/2. The value stored in the table are multiplied by 4096 to stay with integer values with enough precision for my application. The lookup table as a resolution of 1/4096 which give us an array of 6434 values.
Then I have two funcitons sine & cosine that takes an angle in radian multiplied by 4096 as argument. Those functions convert the given angle to the corresponding angle in the first quadrant and read the corresponding value in the table.
My application runs on dsPIC33F at 40 MIPS an I use the C30 compiling suite.
It's pretty hard to say anything with certainty since you have not told us about the hardware, the compiler or your code. However, a priori, I'd expect the standard library from your compiler to be more efficient than your code.
It is perhaps unfortunate that you have to use the C30 compiler which does not support C++, otherwise I'd point you to Optimizing Math-Intensive Applications with Fixed-Point Arithmetic and its associated library.
However the general principles of the CORDIC algorithm apply, and the memory footprint will be far smaller than your current implementation. The article explains the generation of arctan() and the arccos() and arcsin() can be calculated from that as described here.
Of course that suggests also that you will need square-root and division also. These may be expensive though PIC24/dsPIC have hardware integer division. The article on math acceleration deals with square-root also. It is likely that your look-up table approach will be faster for the direct look-up, but perhaps not for the reverse search, but the approaches explained in this article are more general and more precise (the library uses 64bit integers as 36.28 bit fixed point, you might get away with less precision and range in your application), and certainly faster than a standard library implementation using software-floating-point.
You can use a "halfway" approach, combining a coarse-grained lookup table to save memory, and a numeric approximation for the intermediate values (e.g. Maclaurin Series, which will be more accurate than linear interpolation.)
Some examples here.
This question also has some related links.
A binary search of 6434 will take ~12 lookups to find the value, followed by an interpolation if more accuracy is needed. Due to the nature if the sin curve, you will get much more accuracy at one end than the other. If you can spare the memory, making your own inverse table evenly spaced on the inputs is likely a better bet for speed and accuracy.
In terms of comparison to the built-in version, you'll have to test that. When you do, pay attention to how much the size of your image increases. The stdin implementations can be pretty hefty in some systems.

Is it still worth trying to create optimizations for sqrt() in C?

Are the old tricks (lookup table, approx functions) for creating faster implementations of sqrt() still useful, or is the default implementation as fast as it is going to get with modern compilers and hardware?
Rule 1: profile before optimizing
Before investing any effort in the belief that you can beat the optimizer, you must profile everything and discover where the bottleneck really lies. In general, it is unlikely that sqrt() itself is your bottleneck.
Rule 2: replace the algorithm before replacing a standard function
Even if sqrt() is the bottleneck, then it is still reasonably likely that there are algorithmic approaches (such as sorting distances by length squared which is easily computed without a call to any math function) that can eliminate the need to call sqrt() in the first place.
What the compiler does for you if you do nothing else
Many modern C compilers are willing to inline CRT functions at higher optimization levels, making the natural expression including calls to sqrt() as fast as it needs to be.
In particular, I checked MinGW gcc v3.4.5 and it replaced a call to sqrt() with inline code that shuffled the FPU state and at the core used the FSQRT instruction. Thanks to the way that the C standard interacts with IEEE 754 floating point, it did have to follow the FSQRT with some code to check for exceptional conditions and a call to the real sqrt() function from the runtime library so that floating point exceptions can be handled by the library as required by the standard.
With sqrt() inline and used in the context of a larger all double expression, the result is as efficient as possible given the constraints of of standards compliance and preservation of full precision.
For this (very common) combination of compiler and target platform and given no knowledge of the use case, this result is pretty good, and the code is clear and maintainable.
In practice, any tricks will make the code less clear, and likely less maintainable. After all, would you rather maintain (-b + sqrt(b*b - 4.*a*c)) / (2*a) or an opaque block of inline assembly and tables?
Also, in practice, you can generally count on the compiler and library authors to take good advantage of your platform's capabilities, and usually to know more than you do about the subtleties of optimizations.
However, on rare occasions, it is possible to do better.
One such occasion is in calculations where you know how much precision you really need and also know that you aren't depending on the the C standard's floating point exception handling and can get along with what the hardware platform supplies instead.
Edit: I rearranged the text a bit to put emphasis on profiling and algorithms as suggested by Jonathan Leffler in comments. Thanks, Jonathan.
Edit2: Fixed precedence typo in the quadratic example spotted by kmm's sharp eyes.
Sqrt is basically unchanged on most systems. It's a relatively slow operation, but the total system speeds have improved, so it may not be worth trying to use "tricks".
The decision to optimize it with approximations for the (minor) gains this can achieve are really up to you. Modern hardware has eliminated some of the need for these types of sacrifices (speed vs. precision), but in certain situations, this is still valuable.
I'd use profiling to determine whether this is "still useful".
If you have proven that the call to sqrt() in your code is a bottleneck with a profiler then it may be worth trying to create an optimizated version. Otherwise it's a waste of time.
This probably is the fastest method of computing the square root:
float fastsqrt(float val) {
union
{
int tmp;
float val;
} u;
u.val = val;
u.tmp -= 1<<23; /* Remove last bit so 1.0 gives 1.0 */
/* tmp is now an approximation to logbase2(val) */
u.tmp >>= 1; /* divide by 2 */
u.tmp += 1<<29; /* add 64 to exponent: (e+127)/2 =(e/2)+63, */
/* that represents (e/2)-64 but we want e/2 */
return u.val;
}
wikipedia article
This probably is the fastest method of computing the inverse square root. Assume at most 0.00175228 error.
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1);
x = *(float*)&i;
return x*(1.5f - xhalf*x*x);
}
This is (very roughly) about 4 times faster than (float)(1.0/sqrt(x))
wikipedia article
It is generally safe to assume that the standard library developers are quite clever, and have written performant code. You're unlikely to be able to match them in general.
So the question becomes, do you know something that'll let you do a better job? I'm not asking about special algorithms for computing the square root (the standard library developers knows of these too, and if they were worthwhile in general, they'd have used them already), but do you have any specific information about your use case, that changes the situation?
Do you only need limited precision? If so, you can speed it up compared to the standard library version, which has to be accurate.
Or do you know that your application will always run on a specific type of CPU? Then you can look at how efficient that CPU's sqrt instruction is, and see if there are better alternatives. Of course, the downside to this is that if I run your app on another CPU, your code might turn out slower than the standard sqrt().
Can you make assumptions in your code, that the standard library developers couldn't?
You're unlikely to be able to come up with a better solution to the problem "implement an efficient replacement for the standard library sqrt".
But you might be able to come up with a solution to the problem "implement an efficient square root function for this specific situation".
Why not? You probably learn a lot!
I find it very hard to believe that the sqrt function is your application's bottleneck because of the way modern computers are designed. Assuming this isn't a question in reference to some crazy low end processor, you take a tremendous speed hit to access memory outside of your CPU caches, so unless you're algorithm is doing math on a very few numbers (enough that they all basically fit within the L1 and L2 caches) you're not going to notice any speed up from optimizing any of your arithmetic.
I still find it useful even now, though this is the context of normalizing a million+ vectors every frame in response to deforming meshes.
That said, I'm generally not creating my own optimizations but relying on a crude approximation of inverse square root provided as a SIMD instruction: rsqrtps. That is still really useful in speeding up some real-world cases if you're willing to sacrifice precision for speed. Using rsqrtps can actually reduce the entirety of the operation which includes deforming and normalizing vertex normals to almost half the time, but at the cost of the precision of the results (that said, in ways that can barely be noticed by the human eye).
I've also still found the fast inverse sqrt as often credited incorrectly to John Carmack to still improve performance in scalar cases, though I don't use it much nowadays. It's generally natural to get some speed boost if you're willing to sacrifice accuracy. That said, I wouldn't even attempt to beat C's sqrt if you aren't trying to sacrifice precision for speed.
You generally have to sacrifice the generality of the solution (like its precision) if you want to beat standard implementations, and that tends to apply whether it's a mathematical function or, say, malloc. I can easily beat malloc with a narrowly-applicable free list lacking thread-safety that's suitable for very specific contexts. It's another thing to beat it with a general-purpose allocator which can allocate variable-sized chunks of memory and free any one of them at any given time.

Resources