Does C internally use Karatsuba algorithm to multiply two integers? - c

When C multiplies two n-bits integers, does it internally use the normal O(n^2) multiplication algorithm, or does it use a variation of Karatsuba's O(n^log_2(3)) multiplication algorithm ?

No. The 'book-keeping' overhead of the Karatsuba algorithm is too high and too complex. It would take up far more silicon than a multiplier, even if it was to achieve a break-even recursive depth on a machine word level. A hardware crypto accelerator or FPGA might make it worthwhile for large enough n. Even then, the break-even might be too high to be useful for crypto needs. There's no free lunch.
On the other end of the spectrum, we can look at the gmp-mparam.h files in the GMP library, which define threshold values at which asymptotically faster algorithms actually begin to pay off. 'Karatsuba' is the 2x2 case of the more general Toom-Cook algorithm. Even on monsters like Broadwell and Skylake CPUs, the threshold is around 28 'words', or 1792 bits. That's due to the overhead in (recursively) adding 3 results back together, with carry propagation. These thresholds will keep getting higher as multiply instruction throughput increases.

C doesn't multiply. It defines the semantics of the binary * operator. How it's implemented is left to the compiler.
The compiler doesn't multiply. It converts the binary * operator into (a) machine instruction(s), which are executed by the CPU.
The CPU multiplies. It generally provides specialized instructions for that. What instructions it offers to the compiler, and how it implements these instructions depends on the CPU and it's intended use. But multiplication is such a common task that any desktop CPU worth it's salt has dedicated hardware circuitry for it.

Related

How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson?

Just curiosity about the standard sqrt() from math.h on GCC works. I coded my own sqrt() using Newton-Raphson to do it!
yeah, I know fsqrt. But how the CPU does it? I can't debug hardware
Typical div/sqrt hardware in modern CPUs uses a power of 2 radix to calculate multiple result bits at once. e.g. http://www.imm.dtu.dk/~alna/pubs/ARITH20.pdf presents details of a design for a Radix-16 div/sqrt ALU, and compares it against the design in Penryn. (They claim lower latency and less power.) I looked at the pictures; looks like the general idea is to do something and feed a result back through a multiplier and adder iteratively, basically like long division. And I think similar to how you'd do bit-at-a-time division in software.
Intel Broadwell introduced a Radix-1024 div/sqrt unit. This discussion on RWT asks about changes between Penryn (Radix-16) and Broadwell. e.g. widening the SIMD vector dividers so 256-bit division was less slow vs. 128-bit, as well as increasing radix.
Maybe also see
The integer division algorithm of Intel's x86 processors - Merom's Radix-2 and Radix-4 dividers was replaced by Penryn's Radix-16. (Core2 65nm vs. 45nm)
https://electronics.stackexchange.com/questions/280673/why-does-hardware-division-take-much-longer-than-multiplication
https://scicomp.stackexchange.com/questions/187/why-is-division-so-much-more-complex-than-other-arithmetic-operations
But however the hardware works, IEEE requires sqrt (and mul/div/add/sub) to give a correctly rounded result, i.e. error <= 0.5 ulp, so you don't need to know how it works, just the performance. These operations are special, other functions like log and sin do not have this requirement, and real library implementations usually aren't that accurate. (And x87 fsin is definitely not that accurate for inputs near Pi/2 where catastrophic cancellation in range-reduction leads to potentially huge relative errors.)
See https://agner.org/optimize/ for x86 instruction tables including throughput and latency for scalar and SIMD sqrtsd / sqrtss and their wider versions. I collected up the results in Floating point division vs floating point multiplication
For non-x86 hardware sqrt, you'd have to look at data published by other vendors, or results from people who have tested it.
Unlike most instructions, sqrt performance is typically data-dependent. (Usually more significant bits or larger magnitude of the result takes longer).
sqrt is defined by C, so most likely you have to look in glibc.
You did not specify which architecture you are asking for, so I think it's safe to assume x86-64. If that's the case, they are defined in:
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/fpu/e_sqrt.c
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/fpu/e_sqrtf.c
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/fpu/e_sqrtl.c
tl;dr they are simply implemented by calling the x86-64 square root instructions sqrts{sd}:
https://www.felixcloutier.com/x86/sqrtss
https://www.felixcloutier.com/x86/sqrtsd
Furthermore, and just for the sake of discussion, if you enable fast-math (something you probably should not do if you care about result precision), you will see that most compilers will actually inline the call and directly emit the sqrts{sd} instructions:
https://godbolt.org/z/Wb4unC

Determine FLOPS of our ASM program

We had to implement an ASM program for multiplying sparse matrices in the coordinate scheme format (COOS) as well as in the compressed row format (CSR). Now that we have implemented all these algorithms we want to know how much more performant they are in contrast to the usual matrix multiplication. We already implemented code to measure the running time of all these algorithms but now we decided that we also want to know how many floating points operations per seconds (FLOPS) we can perform.
Any suggestion of how to measure/count this?
Here some background information on the used system:
processor : 0
model name : ARMv7 Processor rev 2 (v7l)
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x3
CPU part : 0xc08
CPU revision : 2
Our first idea was now to implement a kind of FPO counter which we increment after each floating point operation (Arithmetic operations as well as comparison and move operations), but that would mean that we have to insert increment operations all over our code which also slows down the application ...
Does anyone know if there is some kind of hardware counter which counts the number of floating point operations or maybe if there exist some kind of performance tool which can be used to monitor our program and measures the number of FPOs.
Any suggestions or pointers would be appreciated.
Here is the evaluation of the FLOPs for a matrix multiplication by using the counting approach. We first measured the running time than inserted counters for each instruction we were interested in and after that we calculated the number of floating point operations per second.
It looks like the closest you can get with the performance events supported by Cortex-A8 is a count of total instructions executed, which isn't very helpful given that "an instruction" performs anything from 0 to (I think) 8 FP operations. Taking a step back, it becomes apparent that trying to measure FLOPS for the algorithm in hardware wouldn't really work anyway - e.g. you could write an implementation using vector ops but not always put real data in all lanes of each vector, then the CPU needs to be psychic to know how many of the FP operations it's executing actually count.
Fortunately, given a formal definition of an algorithm, calculating the number of operations involved should be fairly straightforward (although not necessarily easy, depending on the complexity). For instance, running through it in my head, the standard naïve multiplication of an m x n matrix with an n x m matrix comes out to m * m * (n + n - 1) operations (n multiplications and (n - 1) additions per output element). Once on-paper analysis has come up with an appropriately parameterised op-counting formula, you can plumb that into your benchmarking tool to calculate numbers for the data on test.
Once you've done all that, you'll probably then start regretting spending all the time to do it, because what you'll have is (arbitrary number) / (execution time) which is little more meaningful than (execution time) alone, and mostly just complicates comparison between cases where (arbitrary number) differs. NEON performance in particular is dominated by pipeline latency and memory bandwidth, and as such the low-level implementation details could easily outweigh any inherent difference the algorithms might have.
Think of it this way: say on some given 100MHz CPU a + a + b + b takes 5 cycles total, while (a + b) * 2 takes 4 cycles total* - the former scores 60 MFLOPS, the latter only 50 MFLOPS. Are you going to say that more FLOPS means better performance, in which case the routine which takes 25% longer to give the same result is somehow "better"? Are you going to say fewer FLOPS means better performance, which is clearly untrue for any reasonable interpretation? Or are you going to conclude that FLOPS is pretty much meaningless for anything other than synthetic benchmarks to compare the theoretical maximum bandwidth of one CPU with another?
* numbers pulled out of thin air for the sake of argument; however they're actually not far off something like Cortex-M4F - a single-precision FPU where both add and multiply are single-cycle, plus one or two for register hazards.
Number of Cores x Average frequency x Operations percycle

Should cublas be outperformed by atlas?

According to my measurements of dgemm from both cublas and atlas, atlas severly beats cublas in terms of speed. Is this to be expected for a system with an Intel i7 950 and Nvidia GTX470?
I tested matrices of size 10x10 up to 6000x6000 in increments of 50. Atlas always wins. I measure both total application execution and just the multiplication step.
Anyone else have experience with this? Is this the expected results?
Thanks in advance.
edit: (same code, same results on a Xeon X5670 and Nvidia Tesla C2050)
edit2: It appears a great deal of slowness if attributed to initialisation of the cublas library. I continue to work on it. I'll update here when I learn more.
Did you use the single-threaded versions of both libraries? As far as I understand, both GotoBLAS and Atlas tend to sneakily use multiple threads when working on large matrices.
That said, at large matrix sizes the algorithm used tends to matter much more than the low-level implementation. Naive matrix multiplication is O(N^3), whereas Strassen algorithm scales much better, about O(N^2.81) or so. However, Strassen algorithm happens to vectorize very nicely (to much larger SSE and AVX registers, yielding almost 2 to 8-fold increase in efficiency, depending on floating-point format and register size).
I am not sure how well the two GPUs you mentioned handle double-precision math. Typically they're optimized for single precision (32-bit floats), dropping to a third or a quarter of that speed when handling doubles.
There are other factors in your tests that may skew the results. For example, you may be including the matrix transfer time to the CPU. Whether that matches real world use cases, I don't know; I don't have an Nvidia GPU to test.. but I suspect not. Usually there are multiple operations, and the matrix does not need to be transferred between operations.
I've been writing my own low-level SSE3 matrix functions using SSE/AVX vector built-in functions provided by GCC and ICC C99 compilers; early testing indicates it beats the current Fortran implementations by a wide margin, especially at the very small (say up to 8x8, optimized for each size) and very large (above 1000x1000, using Strassen algorithm) sizes for dense matrices.

C - the limits of speed of the Desktop-CPUs if program is build using GCC with all optimization flags?

We are planning to port a big part of our Digital Signal Processing routines from hardware-specific chips to the common desktop CPU architecture like Quad-Core or so. I am trying to estimate the limits of such architecture for a program build with GCC. I am mostly interested in a high SDRAM-CPU bandwidth [Gb/sec] and in a high number of the 32-Bit IEEE-754 floating point Multiply-Accumulate operations per second.
I have selected a typical representative of the modern desktop CPUs - Quad Core, about 10Mb cache, 3GHz, 45nm. Can you please help me to find out its limits:
1) Highest possible Multiply-Accumulate operations per second if CPU's specific instructions which GCC supports using input flags will be used and all cores will be used. The source code itself must not require changes if we decide to port it to the different CPU-architecture like Altivec on PowerPC - the best option is to use GCC flags like -msse or -maltivec. I suggest also, a program has to have 4 threads in order to utilize all available cores, right?
2) SDRAM-CPU bandwidth (highest limit, so indep. on the mainboard).
UPDATE: Since GCC 3, GCC can automatically generate SSE/SSE2 scalar code when the target supports those instructions. Automatic vectorization for SSE/SSE2 has been added since GCC 4. SSE4.1 introduces DPPS, DPPD instructions - Dot product for Array of Structs data. New 45nm Intel processors support SSE4 instructions.
First off, know that it will most likely not be possible for your code to both run as fast as possible on modern vector FPU units and be completely portable across architectures. It is possible to abstract away some aspects of the architectures via macros, etc, but compilers are (at present) capable of generating nearly optimal auto-vectorized code only for very simple programs.
Now, on to your questions: current x86 hardware does not have a multiply-accumulate, but is capable of one vector add and one vector multiply per cycle per core. Assuming that your code achieves full computational density, and you either hand-write vector code or your code is simple enough for the compiler to handle the task, the peak throughput that can be achieved independent of memory access latency is:
number of cores * cycles per second * flops per cycle * vector width
Which in your case sounds like:
4 * 3.2 GHz * 2 vector flops/cycle * 4 floats/vector = 102.4 Gflops
If you are going to write scalar code, divide that by four. If you are going to write vector code in C with some level of portable abstraction, plan to be leaving some performance on the table, but you can certainly go substantially faster than scalar code will allow. 50% of theoretical peak is a conservative guess (I would expect to do better assuming the algorithms are amenable to vectorization, but make sure you have some headroom in your estimates).
edit: notes on DPPS:
DPPS is not a multiply-add, and using it as one is a performance hazard on current architectures. Looking it up in the Intel Optimization Manual, you will find that it has a latency of 11 cycles, and throughput is only one vector result every two cycles. DPPS does up to four multiplies and three adds, so you're getting 2 multiplies per cycle and 1.5 adds, whereas using MULPS and ADDPS would get you 4 of each every cycle.
More generally, horizontal vector operations should be avoided unless absolutely necessary; lay out your data so that your operations stay within vector lanes to the maximum extent possible.
In fairness to Intel, if you can't change your data layout, and DPPS happens to be exactly the operation that you need, then you want to use it. Just be aware that you're limiting yourself to less than 50% of peak FP throughput right off the bat by doing so.
This may not directly answer your question, but have you considered using the PC's graphics cards for parallel floating-point computations? It's getting to the point where GPUs will outperform CPUs for some tasks; and the nice thing is that graphics cards are reasonably competitively priced.
I'm short on details, sorry; this is just to give you an idea.
Some points you should consider:
1) Intel's i7-architecture is in the moment your fastest options for 1 or 2 CPUs. Only for 4 or more sockets AMD's Opterons can compete.
2) Intel's compilers generate code that is often significantly faster that code generated by other compilers (when used on AMD's CPUs you have to patch away some CPU checks Intel puts in to prevent AMD to look good).
3) No x86-CPU supports multiply-and-add yet, AMD's next architecure "Bulldozer" will probably be the first to support it.
4) High memory bandwidth you get on any AMD CPU and on Intel only for the new i7-architecture (socket 1366 is better than 775).
5) Use Intel's highly efficient libraries
if possible.

Finding prime factors to large numbers using specially-crafted CPUs

My understanding is that many public key cryptographic algorithms these days depend on large prime numbers to make up the keys, and it is the difficulty in factoring the product of two primes that makes the encryption hard to break. It is also my understanding that one of the reasons that factoring such large numbers is so difficult, is that the sheer size of the numbers used means that no CPU can efficiently operate on the numbers, since our minuscule 32 and 64 bit CPUs are no match for 1024, 2048 or even 4096 bit numbers. Specialized Big Integer math libraries must be used in order to process those numbers, and those libraries are inherently slow since a CPU can only hold (and process) small chunks (like 32 or 64 bits) at one time.
So...
Why can't you build a highly specialized custom chip with 2048 bit registers, and giant arithmetic circuits, much in the same way that we scaled from 8 to 16 to 32 to 64-bit CPUs, just build one a LOT larger? This chip wouldn't need most of the circuitry on conventional CPUs, after all it wouldn't need to handle things like virtual memory, multithreading or I/O. It wouldn't even need to be a general-purpose processor supporting stored instructions. Just the bare minimum to perform the necessary arithmetical calculations on ginormous numbers.
I don't know a whole lot about IC design, but I do remember learning about how logic gates work, how to build a half adder, full adder, then link together a bunch of adders to do multi-bit arithmetic. Just scale up. A lot.
Now, I'm fairly certain that there is a very good reason (or 17) that the above won't work (since otherwise one of the many people smarter than I am would have already done it) but I am interested in knowing why it won't work.
(Note: This question may need some re-working, as I'm not even sure yet if the question makes sense)
What #cube said, and the fact that a giant arithmetic logic unit would take more time for the logic signals to stabilize, and include other complications in digital design. Digital logic design includes something that you take for granted in software, namely that signals through combinational logic take a small but nonzero time to propagate and settle. A 32x32 multiplier needs to be designed carefully. A 1024x1024 multiplier would not only take a huge amount of physical resources in a chip, but it also would be slower than a 32x32 multiplier (though perhaps faster than a 32x32 multiplier computing all the partial products needed to perform a 1024x1024 multiply). Plus it's not only the multiplier that's the bottleneck: you've got memory pathways. You'd have to spend a bunch of time gathering the 1024 bits from a memory circuit that's only 32 bits wide, and storing the resulting 2048 bits back into the memory circuit.
Almost certainly it's better to get a bunch of "conventional" 32-bit or 64-bit systems working in parallel: you get the speedup w/o the hardware design complexity.
edit: if anyone has ACM access (I don't), perhaps take a look at this paper to see what it says.
Its because this speedup would be only in O(n), but the complexity of factoring the number is something like O(2^n) (with respect to the number of bits). So if you made this überprocessor and factorized the numbers 1000 times faster, I would only have to make the numbers 10 bits larger and we would be back on the start again.
As indicated above, the primary problem is simply how many possibilities you have to go through to factor a number. That being said, specialized computers do exist to do this sort of thing.
The real progress for this sort of cryptography is improvements in number factoring algorithms. Currently, the fastest known general algorithm is the general number field sieve.
Historically, we seem to be able to factor numbers twice as large each decade. Part of that is faster hardware, and part of it is simply a better understanding of mathematics and how to perform factoring.
I can't comment on the feasibility of an approach exactly like the one you described, but people do similar things very frequently using FPGAs:
Crack DES keys
Crack GSM conversations
Open source graphics card
Shamir & Tromer suggest a similar approach, using a kind of grid computing:
This article discusses a new design for a custom hardware
implementation of the sieving step, which
reduces [the cost of sieving, relative to TWINKLE,] to about $10M. The new device,
called TWIRL, can be seen as an extension of the
TWINKLE device. However, unlike TWINKLE it
does not have optoelectronic components, and can
thus be manufactured using standard VLSI technology
on silicon wafers. The underlying idea is to use
a single copy of the input to solve many subproblems
in parallel. Since input storage dominates cost, if the
parallelization overhead is kept low then the resulting
speedup is obtained essentially for free. Indeed, the
main challenge lies in achieving this parallelism efficiently while allowing compact storage of the input.
Addressing this involves myriad considerations, ranging
from number theory to VLSI technology.
Why don't you try building an uber-quantum computer and run Shor's algorithm on it?
"... If a quantum computer with a sufficient number of qubits were to be constructed, Shor's algorithm could be used to break public-key cryptography schemes such as the widely used RSA scheme. RSA is based on the assumption that factoring large numbers is computationally infeasible. So far as is known, this assumption is valid for classical (non-quantum) computers; no classical algorithm is known that can factor in polynomial time. However, Shor's algorithm shows that factoring is efficient on a quantum computer, so a sufficiently large quantum computer can break RSA. ..." -Wikipedia

Resources