Arbitrary precision integer on GPU - bigint

I'm currently doing a primality test on huge numbers (up to 10M digits).
Right now, I'm using a c program using the GMP library. I did some parallelization using OpenMP and got a nice speedup (3.5~ with 4 cores). The problem is that I don't have enough CPU cores to make it feasible to run with my whole dataset.
I have an NVidia GPU and, I tried to find an alternative to GMP, but for GPUs. It can be either CUDA or OpenCL.
Is there an arbitrary precision library that I can run on my GPU? I'm also open to using another programming language if there is a simple or more elegant way of doing it.

It seems the Julia Language is already able to do multiprecision arithmetic and use the GPU (see here for a simple example combining these two), but you might have to learn Julia and rewrite your code.
The CUMP library is meant to be a GMP substitute for CUDAs, it attempts to make it easier for porting GMP code to CUDAs by offering an interface similar to GMP, for example you replace mpf_... variables and functions by cumpf_... ones. There is a demo you can make if it fits your CUDA. No documentation or support though, you'll have to go through the code to see if it works.
The CAMPARY library from people in LAAS-CNRS might be a shot as well, but no documentation either. It has been applied in more research than CUMP, so some chance there. An answer here gives some clarification on how to use it.
GRNS uses the residue number system on CUDA-compatible processors, it has no documentation but an interesting paper is available. Also, see this one.
XMP comes directly from NVIDIA labs, but seems incomplete and has no docs also. Some info here and a paper here.
XMP 2.0 seems newer but only support sizes up to 32k bits for now.
GPUMP looks promising, but seems not available for download, perhaps by contacting the authors.
The MPRES-BLAS is a multiprecision library for linear algebra on CUDAs, and does - of course - have code for basic arithmetic. Also, papers here and here.
I haven't tested any of those, so please let us know if any of those work well.

Related

Is there any way of doing multiprecision arithmetic(with integers that are greater than 64-bit) in msp430?

I'd like to know if there is any way, if possible any simple way, to do arithmetic with integers that are larger than 64-bit in size on MSP430?
I'm asking this specifically because I'm trying to implement encryption algorithms (RSA, AES, hash functions, digital signatures, etc.) on an msp430g2553 platform.
I searched through the internet, and through a misguided despair I installed linux distributions in order to use GMP but failed miserably. I installed Kali and then later Lubuntu on a USB(2.0) stick only to suffer through the unbearable freezes, with no clue on if it would work or not. Later on tried the magic of VMBox and things got way easier after that, though inconclusive. I eventually came to a point with mps430-gcc, and mspdebug that I could debug some example codes, and see them work, but still unable to do GMP operations due to mostly library errors (undefined reference to mpz_t init... etc.).
To my understanding GMP is a multiprecision arithmetic library to work on specific processor architectures, and MSP430 is not one of them, though, at this point, I'd not be surprised if it is one of them. Best answer I got was that some TI employee is unfamiliar with it. So;
Is it possible to use GMP on MSP430, or more specifically on
msp430g2553?
I've hardly seen anything on google that has cross-reference of msp430 with gmp, and I'm at point that I'm trying to implement a miserable 64-bit key sized RSA that's barely working, if at all. So I hope this post, and its answers help someone, and hopefully to me as well, later on.
Also I forgot to mention, I've read about relic toolkit, (but didn't spend my time trying to implement it as GMP looked more like the standard in that area), and I'd like to know:
Is there a Relic guide for dummies that you can link, and again if it's possible to work with it on MSP430?
Thanks everyone.
It is unlikely that any of these libraries can be compiled for an embedded 16-bit architecture.
The MSP430 CPU has add-with-carry and similar instructions, and that is how the compiler implements 32- and 64-bit integers.
So it would be possible, in theory, to write these algorithms yourself, with lots of (inline) assembly.
But I doubt that the G2553 has enough memory for this.
(There is a reason that some larger MSP430s have an AES hardware accelerator, and that none have one for RSA.)

What is the relation between BLAS, LAPACK and ATLAS

I don't understand how BLAS, LAPACK and ATLAS are related and how I should use them together! I have been looking through all of their manuals and I have a general idea of BLAS and LAPACK and how to use them with the very few examples I find, but I can't find any actual examples using ATLAS to see how it is related with these two.
I am trying to do some low level work on matrixes and my primary language is C. First I wanted to use GSL, but it says that if you want the best performance you should use BLAS and ATLAS. Is there any good webpage giving some nice examples of how to use these (in C) all together? In other words I am looking for a tutorial on using these three (or any subset of them!). In short I am confused!
BLAS is a collection of low-level matrix and vector arithmetic operations (“multiply a vector by a scalar”, “multiply two matrices and add to a third matrix”, etc ...).
LAPACK is a collection of higher-level linear algebra operations. Things like matrix factorizations (LU, LLt, QR, SVD, Schur, etc) that are used to do things like “find the eigenvalues of a matrix”, or “find the singular values of a matrix”, or “solve a linear system”. LAPACK is built on top of the BLAS; many users of LAPACK only use the LAPACK interfaces and never need to be aware of the BLAS at all. LAPACK is generally compiled separately from the BLAS, and can use whatever highly-optimized BLAS implementation you have available.
ATLAS is a portable reasonably good implementation of the BLAS interfaces, that also implements a few of the most commonly used LAPACK operations.
What “you should use” depends somewhat on details of what you’re trying to do and what platform you’re using. You won't go too far wrong with “use ATLAS + LAPACK”, however.
While ago, when I started doing some linear algebra in C, it came to me as a surprise to see there are so few tutorials for BLAS, LAPACK and other fundamental APIs, despite the fact that they are somehow the cornerstones of many other libraries. For that reason I started collecting all the examples/tutorials I could find all over the internet for BLAS, CBLAS, LAPACK, CLAPACK, LAPACKE, ATLAS, OpenBLAS ... in this Github repo.
Well, I should warn you that as a mechanical engineer I have little experience in managing such a git repository or GitHub. It will first seem as a complete mess to you guys. However if you manage to get over the messy structure you will find all kind of examples and instructions which might be a help. I have tried most of them, to be sure they compile. And the ones which do not compile I have mentioned. I have modified many of them to be compilable with GNU compilers (gcc, g++ and gfortran). I have made MakeFiles which you can read to learn how you can call individual Fortran/FORTRAN routines in a C or C++ program. I have also put some installations instructions for mac and linux (sorry windows guys!). I have also made some bash .sh files for automatic compilation of some of these libraries.
But going to your other question: BLAS and LAPACK are rather APIs not specific SDKs. They are just a list of specifications or language extensions rather than an implementations or libraries. With that said, there are original implementations by Netlib in FORTRAN 77, which most people refer to (confusingly!) when talking about BLAS and LAPACK. So if you see a lot of strange things when using these APIs is because you were actually calling FORTRAN routines in C rather than C libraries and functions. ATLAS and OpenBLAS are some of the best implementations of BLAS and LACPACK as far as I know. They conform to the original API, even though, to my knowledge they are implemented on C/C++ from scratch (not sure!). There are GPGPU implementations of the APIs using OpenCL: CLBlast, clBLAS, clMAGMA, ArrayFire and ViennaCL to mention some. There are also vendor specific implementations optimized for specific hardware or platform, which I strongly discourage anybody to use them.
My recommendation to anyone who wants to learn using BLAS and LAPACK in C is to learn FORTRAN-C mixed programming first. The first chapter of the mentioned repo is dedicated to this matter and there I have collected many different examples.
P.S. I have been working on the dev branch of the repository time to time. It seems slightly less messy!
ATLAS is by now quite outdated. It was developed at a time when it was thought that optimizing the BLAS for various platforms was beyond the ability of humans, and as a result autogeneration and autotuning was the way to go.
In the early 2000s, along came Kazushige Goto, who showed how highly efficient implementations can be coded by hand. You may enjoy an interesting article in the New York Times: https://www.nytimes.com/2005/11/28/technology/writing-the-fastest-code-by-hand-for-fun-a-human-computer-keeps.html.
Kazushige's on the one hand had better insights into the theory behind high-performance implementations of matrix-matrix multiplication, and on the other hand engineered these better. His approach, which on current CPUs is usually the highest performing, is not in the search space that ATLAS autotunes. Hence, ATLAS is inherently inferior. Kazushige's implementation of the BLAS became known as the GotoBLAS. It was forked as the OpenBLAS when he joined industry.
The ideas behind the GotoBLAS were refactored into a new implementation, the BLAS-like Library Instantiation Software (BLIS) framework (https://github.com/flame/blis), which implements the same algorithms, but structures the code so that less needs to be custom implemented for a new architecture. BLIS is coded in C.
What this discussion shows is that there are many implementation of the BLAS. The BLAS themselves are a de facto standard for the interface. ATLAS was once the state of the art. It is no longer.
As far as I'm aware, and after working through the ATLAS repository, it seems that it includes a re-implementation of BLAS in C. There's bit more to it than that but I hope it answers the question.

pyCUDA vs C performance differences?

I'm new to CUDA programming and I was wondering how the performance of pyCUDA is compared to programs implemented in plain C.
Will the performance be roughly the same? Are there any bottle necks that I should be aware of?
EDIT:
I obviously tried to google this issue first, and was surprised to not find any information. i.e. I would have excepted that the pyCUDA people have this question answered in their FAQ.
If you're using CUDA -- whether directly through C or with pyCUDA -- all the heavy numerical work you're doing is done in kernels that execute on the gpu and are written in CUDA C (directly by you, or indirectly with elementwise kernels). So there should be no real difference in performance in those parts of your code.
Now, the initialization of arrays, and any post-work analysis, will be done in python (probably with numpy) if you use pyCUDA, and that generally will be significantly slower than doing it directly in a compiled language (though if you've built your numpy/scipy in such a way that it links directly to high-performance libraries, then those calls at least would perform the same in either language). But hopefully, your initialization and finalization are small fractions of the total amount of work you have to do, so that even if there is significant overhead there, it still hopefully won't have a huge impact on overall runtime.
And in fact if it turns out that the python parts of the computation does hurt your application's performance, starting out doing your development in pyCUDA may still be an excellent way to get started, as the development is significantly easier, and you can always re-implement those parts of the code that are too slow in Python in straight C, and call those from python, gaining some of the best of both worlds.
If you're wondering about performance differences by using pyCUDA in different ways, see SimpleSpeedTest.py included in the pyCUDA Wiki examples. It benchmarks the same task completed by a CUDA C kernel encapsulated in pyCUDA, and by several abstractions created by pyCUDA's designer. There's a performance difference.
I've been using pyCUDA for a little while an I like prototyping with it because it speeds up the process of turning an idea into working code.
With pyCUDA you will be writing the CUDA kernels using C++, and it's CUDA, so there shouldn't be a difference in performance of running that code. But there will be a difference in the performance of the code you write in Python to setup or use the results of the pyCUDA kernel vs the one you write in C.
I was looking for an answer for the original question in this post and I see the problem Is deeper as I thought.
I my experience, I compared Cuda kernels and CUFFT's written in C with that written in PyCuda. Surprisingly, I found that, on my computer, the performance of suming, multiplying or making FFT's vary from each implentatiom. For example, I got almost the same performance in cuFFT for vector sizes until 2^23 elements. However, suming and multiplying complex vectors show some troubles. The speed up obtained in C/Cuda was ~6X for N=2^17, whilst in PyCuda only ~3X. It also depends on the way that the sumation was performed. By using SourceModule and wrapping the Raw Cuda code, I found the problem that my kernel, for complex128 vectors, was limitated for a lower N (<=2^16) than that used for gpuarray (<=2^24).
As a conclusion, is a good job testing and comparing the two sides of the problem and evaluate if it is convenient spend time in writing a Cuda script or gain readbility and pay the cost of a lower performance.
Make sure you're using -O3 optimizations there and use nvprof/nvvp to profile your kernels if you're using PyCUDA and you want to get high performance. If you want to use Cuda from Python, PyCUDA is probably THE choice. Because interfacing C++/Cuda code via Python is just hell otherwise. You have to write a hell lot of ugly wrappers. And for numpy integration even more hardcore wrap-up code would be necessary.

What's the significant difference between these two image processing libs of c/c++?

OpenCV
GIL
Is there some experienced user can share it?
I need to know the difference to decide which to choose.
A vote for OpenCV, if running Intel processor, it can use the Intel IPP optimized libraries.
The difference between those libs is huge.
The OpenCV library have a lot of image processing functions for solving various tasks that are common in computer vision.
The GIL library haven't such algorithms at all. It provides easy and strongly typed interfaces to work with different image formats (BGR, CMYK, etc) and that's all. Of course, you can write your own image processing algorithms that works with GIL types. But if want to work with feature descriptors, keypoints, image segmentation and so on - the GIL is not a good choice. IMHO.

A C library for finding local maxima?

I'm trying to write an audio analysis application, and I need to identify local maxima in a 2D array which represents a spectrogram. I've already got an open source library that can generate the spectrogram using Fast Fourier Transforms, but I was wondering if anybody knew of any good libraries to help me with actually finding the maxima? I'm not quite sure what to search Google for - the best I could think of was "numerical library" but that hasn't got me very far.
Preferably in C, but I'm open to other suggestions.
Peak finding is a fairly general problem. It has already been discussed once on SO as Peak detection of measured signal.
The answers provided include several viable heuristics.
Of course, I prefer my own answer if you need rigor, but ROOT is written in c++, and is almost certainly too heavy for your application, so you'll need to strip out just the code you want...
The GNU Scientific Library features a multidimensional minimization framework that can be made to work for maximization easily enough. It's designed to only return a single minimum rather than a bunch of different minima, however.

Resources