Neon on Android limited by memory access?

Neon on Android limited by memory access? - arm

I have programmed a routine to process single float arrays using Neon on the Android platform, specifically the Samsung S4, and find that my Neon routines are limited by the access to the array data. For interests sake, snippet below:
Neon
m1 = vmulq_f32(*(float32x4_t *)&ey[i][j],*(float32x4_t *)&caey[i][j]);
m2 = vsubq_f32(*(float32x4_t *)&hz[i-1][j],*(float32x4_t *)&hz[i][j]);
m3 = vmulq_f32(*(float32x4_t *)&cbey[i][j],m2);
m4 = vaddq_f32(m1,m3);
vst1q_f32(&ey[i*je+j],m4);
Serial
ey[i][j] = caey[i][j] * ey[i][j] + cbey[i][j] * ( hz[i-1][j] - hz[i][j] );
Built on Android phone using C4droid gcc and also AIDE-JNI. The Neon intrinsics code above takes slightly longer to process than the serial equivalent. When replacing the array data with dummy const floats then the code runs nearly 4 times as quick as the serial with array data, although it will of course produce nonsense results (this does confirm that the performance problem lies with the data access). My equivalent SSE and AVX code on other platforms produces good speedups.
I have tried 1D equivalent arrays and prefetching data with __builtin_prefetch , but can not speed up the data access to the Neon intrinsics.
Is there anything else I can try to improve the data access performance on the Android phone ??

Related

Neon code for converting b/w rectangular and polar forms

I'm trying to do an FFT->signal manipulation->Inverse FFT using Project NE10 in my CPP project and convert the complex output to amplitudes and phases for FFT and vice versa for IFFT. But the performance of my C++ code is not as good as the SIMD enabled NE10 code as per the benchmarks. Since I have no experience with arm assembly, I'm looking for some help to write neon code for the unoptimised C module. For example, before IFFT I do this:
for (int bin = 0; bin < NUM_FREQUENCY_BINS; bin++) {
input[bin].real = amplitudes[bin] * cosf(phases[bin]);
input[bin].imag = amplitudes[bin] * sinf(phases[bin]);
}
where input is an array of C structs (for complex values), amplitudes & phases are float arrays.
The above block (O(n) complexity) takes about 0.6ms for 8192 bins while NE10 FFT (O(n*log(n)) complexity) takes only 0.1ms because of SIMD operations. From what I've read so far on StackOverflow and other places, intrinsics are not worth the effort, so I'm trying in arm neon only.

You can use NEON for trig functions if you settle for approximations. I am not affiliated, but there is an implementation here that uses intrinsics to create vectorised sin/cos functions accurate to many decimal places that perform substantially better than simply calling sinf, etc (benchmarks are provided by the author).
The code is especially well suited to your polar to cartesian calculation, as it generates sin and cos results simultaneously. It might not be suitable for something where absolute precision is crucial, but for anything to do with frequency domain audio processing, this normally is not the case.

As I know NEON doesn't support vector operations for geometric functions (sin, cos). But of course you can improve your code. As variant you can use the table of pre-calculated values of functions sinus and cosine. It can lead to significant improvement of performance.
Concerning to using of intrinsics for NEON. I have tried to use both of them, but in most case they give practically the same result (for modern compiler). But using if assembler is more labor-intensive. The main performance improvement is given by the correct manipulation with data (loading, storing) and using of vector instructions but these actions can be performed with using of intrinsics .
Of course if you want to achieve 100% utilization of CPU you sometimes need to use assembler. But it is rare case.

NEON, SSE and interleaving loads vs shuffles

I'm trying to understand the comment made by "Iwillnotexist Idonotexist" at SIMD optimization of cvtColor using ARM NEON intrinsics:
... why you don't use the ARM NEON intrisics that map to the VLD3 instruction? That spares you all of the shuffling, both simplifying and speeding up the code. The Intel SSE implementation requires shuffles because it lacks 2/3/4-way deinterleaving load instructions, but you shouldn't pass on them when they are available.
The trouble I am having is the solution offers code that is non-interleaved, and it performs fused multiplies on floating points. I'm trying to separate the two and understand just the interleaved loads.
According to the other question's comment and Coding for NEON - Part 1: Load and Stores, the answer is probably going to use VLD3.
Unfortunately, I'm just not seeing it (probably because I'm less familiar with NEON and its intrinsic functions). It seems like VLD3 basically produces 3 outputs for each input, so my metal model is confused.
Given the following SSE instrinsics that operate on data in BGR BGR BGR BGR... format that needs a shuffle for BBBB GGGG RRRR ...:
const byte* data = ... // assume 16-byte aligned
const __m128i mask = _mm_setr_epi8(0,3,6,9,12,15,1,4,7,10,13,2,5,8,11,14);
__m128i a = _mm_shuffle_epi8(_mm_load_si128((__m128i*)(data)),mask);
How do we perform the interleaved loads using NEON intrinsics so that the we don't need the SSE shuffles?
Also note... I'm interested in intrinsics and not ASM. I can use ARM's intrinsics on Windows Phone, Windows Store, and Linux powered devices under MSVC, ICC, Clang, etc. I can't do that with ASM, and I'm not trying to specialize the code 3 times (Microsoft 32-bit ASM, Microsoft 64-bit ASM and GCC ASM).

According to this page:
The VLD3 intrinsic you need is:
int8x8x3_t vld3_s8(__transfersize(24) int8_t const * ptr);
// VLD3.8 {d0, d1, d2}, [r0]
If at address pointed by ptr you have this data:
0x00: 33221100
0x04: 77665544
0x08: bbaa9988
0x0c: ffddccbb
0x10: 76543210
0x14: fedcba98
You will finally get in the registers:
d0: ba54ffbb99663300
d1: dc7610ccaa774411
d2: fe9832ddbb885522
The int8x8x3_t structure is defined as:
struct int8x8x3_t
{
int8x8_t val[3];
};

Fast Vector Gaussian Normal Random Numbers in C on Intel Core Processors (AVX, AES)?

THIS QUESTION IS ABOUT C++ CODE TARGETED FOR AVX/AVX2 INSTRUCTIONS, as shipped in Intel processors since 2013 (and/or AVX-512 since 2015).
How do I generate one million random Gaussian unit normals fast on Intel processors with new instructions sets?
More generic versions of this question were asked a few times before, e.g., as in Generate random numbers following a normal distribution in C/C++. Yes, I know about Box-Muller and adding and other techniques. I am tempted to build my inverse normal distribution, sample (i.e., map) exactly according to expectations (pseudo-normals, then), and then randomly rearrange sort order.
But, I also know I am using an Intel Core processor with recent AVX vector and AES instruction sets. besides, I need C (not C++ with its std library), and it needs to work on Linux and OSX with gcc.
So, is there a better processor-specific way to generate so many random numbers fast? For such large quantities of random numbers, does Intel processor hardware even offer useful instructions? Are they an option worth looking into: and if so, is there an existing standard function implementation of "rnorm"?

Is Intel Xeon Phi used intrinsics get good performance than Auto-Vectorization?

Intel Xeon Phi provides using the "IMCI" instruction set ,
I used it to do "c = a*b" , like this:
float* x = (float*) _mm_malloc(N*sizeof(float), ALIGNMENT) ;
float* y = (float*) _mm_malloc(N*sizeof(float), ALIGNMENT) ;
float z[N];
_Cilk_for(size_t i = 0; i < N; i+=16)
{
__m512 x_1Vec = _mm512_load_ps(x+i);
__m512 y_1Vec = _mm512_load_ps(y+i);
__m512 ans = _mm512_mul_ps(x_1Vec, y_1Vec);
_mm512_store_pd(z+i,ans);
}
And test it's performance , when the N SIZE is 1048576,
it need cost 0.083317 Sec , I want to compare the performance with auto-vectorization
so the other version code like this:
_Cilk_for(size_t i = 0; i < N; i++)
z[i] = x[i] * y[i];
This version cost 0.025475 Sec(but sometimes cost 0.002285 or less, I don't know why?)
If I change the _Cilk_for to #pragma omp parallel for, the performance will be poor.
so, if the answer like this, why we need to use intrinsics?
Did I make any mistakes any where?
Can someone give me some good suggestion to optimize the code?

The measurements don't mean much, because of various mistakes.
The code is storing 16 floats as 8 doubles. The _mm512_store_pd should be _mm512_store_ps.
The code is using _mm512_store_... on an unaligned location with address z+i, which may cause a segmentation fault. Use __declspec(align(64)) to fix this.
The arrays x and y are not initialized. That risks introducing random numbers of denormal values, which might impact performance. (I'm not sure if this is an issue for Intel Xeon Phi).
There's no evidence that z is used, hence the optimizer might remove the calculation. I think it is not the case here, but it's a risk with trivial benchmarks like this.
Also, allocating a large array on the stack risks stack overflow.
A single run of the examples is probably a poor benchmark, because the time is probably dominated by fork/join overheads of the _Cilk_for. Assuming 120 Cilk workers (the default for 60 4-way threaded cores), there is only about 1048576/120/16 = ~546 iterations per worker. With a clock rate over 1 GHz, that won't take long. In fact, the work in the loop is so small that most likely some workers never get a chance to steal work. That might account for why the _Cilk_for outruns OpenMP. In OpenMP, all the threads must take part in a fork/join for a parallel region to finish.
If the test were written to correct all the mistakes, it would essentially be computing z[:] = x[:]*y[:] on a large array. Because of the wide vector units on Intel(R) Xeon Phi(TM), this becomes a test of memory/cache bandwidth, not ALU speed, since the ALU is quite capable of outrunning memory bandwidth.
Intrinsics are useful for things that can't be expressed as parallel/simd loops, typically stuff needing fancy permutations. For example, I've used intrinsics to do a 16-element prefix-sum operation on MIC (only 6 instructions if I remember correctly).

My answer below equally applies to Intel Xeon and Intel Xeon Phi.
Intrinsics-bases solution is most "powerful" just "like" assembly coding is.
but on the negative side, intrinsics-based solution is usually not (most) portable, not "productivity"-
oriented approach and is often non-applicable for established "legacy" software codebases.
plus it often requires programmer to be low-level and even micro-architecture expert.
However there are approaches alternate to intrinsics/assembly coding. They are:
A) auto-vectorization (when compiler recognizes some patterns and automatically generate vector code)
B) "explicit" or user-guided vectorization (when programmer provide some guidance to compiler in terms of what to vectorize and under which conditions, etc; explicit vectorization usually implies using keywords or pragmas)
C) Using VEC clasess or other kind of intrinsics wrapper libraries or even very specialized compilers. In fact, 2.C is often as bad as intrinsics coding in terms of productivity and legacy code incremental updates)
In your second code snippet you seem to use "explicit" vectorization, which is currently achievable when using Cilk Plus and OpenMP4.0 "frameworks" supported by all recent versions of Intel Compiler and also by GCC4.9.
(I said that you seem to use explicit vectorization, because Cilk_for was originally invented for the purpose of multi-threading, however most recent version of Intel Compiler might automatically parallelize and vectorize the loop, when cilk_for is used)

C - the limits of speed of the Desktop-CPUs if program is build using GCC with all optimization flags?

We are planning to port a big part of our Digital Signal Processing routines from hardware-specific chips to the common desktop CPU architecture like Quad-Core or so. I am trying to estimate the limits of such architecture for a program build with GCC. I am mostly interested in a high SDRAM-CPU bandwidth [Gb/sec] and in a high number of the 32-Bit IEEE-754 floating point Multiply-Accumulate operations per second.
I have selected a typical representative of the modern desktop CPUs - Quad Core, about 10Mb cache, 3GHz, 45nm. Can you please help me to find out its limits:
1) Highest possible Multiply-Accumulate operations per second if CPU's specific instructions which GCC supports using input flags will be used and all cores will be used. The source code itself must not require changes if we decide to port it to the different CPU-architecture like Altivec on PowerPC - the best option is to use GCC flags like -msse or -maltivec. I suggest also, a program has to have 4 threads in order to utilize all available cores, right?
2) SDRAM-CPU bandwidth (highest limit, so indep. on the mainboard).
UPDATE: Since GCC 3, GCC can automatically generate SSE/SSE2 scalar code when the target supports those instructions. Automatic vectorization for SSE/SSE2 has been added since GCC 4. SSE4.1 introduces DPPS, DPPD instructions - Dot product for Array of Structs data. New 45nm Intel processors support SSE4 instructions.

First off, know that it will most likely not be possible for your code to both run as fast as possible on modern vector FPU units and be completely portable across architectures. It is possible to abstract away some aspects of the architectures via macros, etc, but compilers are (at present) capable of generating nearly optimal auto-vectorized code only for very simple programs.
Now, on to your questions: current x86 hardware does not have a multiply-accumulate, but is capable of one vector add and one vector multiply per cycle per core. Assuming that your code achieves full computational density, and you either hand-write vector code or your code is simple enough for the compiler to handle the task, the peak throughput that can be achieved independent of memory access latency is:
number of cores * cycles per second * flops per cycle * vector width
Which in your case sounds like:
4 * 3.2 GHz * 2 vector flops/cycle * 4 floats/vector = 102.4 Gflops
If you are going to write scalar code, divide that by four. If you are going to write vector code in C with some level of portable abstraction, plan to be leaving some performance on the table, but you can certainly go substantially faster than scalar code will allow. 50% of theoretical peak is a conservative guess (I would expect to do better assuming the algorithms are amenable to vectorization, but make sure you have some headroom in your estimates).
edit: notes on DPPS:
DPPS is not a multiply-add, and using it as one is a performance hazard on current architectures. Looking it up in the Intel Optimization Manual, you will find that it has a latency of 11 cycles, and throughput is only one vector result every two cycles. DPPS does up to four multiplies and three adds, so you're getting 2 multiplies per cycle and 1.5 adds, whereas using MULPS and ADDPS would get you 4 of each every cycle.
More generally, horizontal vector operations should be avoided unless absolutely necessary; lay out your data so that your operations stay within vector lanes to the maximum extent possible.
In fairness to Intel, if you can't change your data layout, and DPPS happens to be exactly the operation that you need, then you want to use it. Just be aware that you're limiting yourself to less than 50% of peak FP throughput right off the bat by doing so.

This may not directly answer your question, but have you considered using the PC's graphics cards for parallel floating-point computations? It's getting to the point where GPUs will outperform CPUs for some tasks; and the nice thing is that graphics cards are reasonably competitively priced.
I'm short on details, sorry; this is just to give you an idea.

Some points you should consider:
1) Intel's i7-architecture is in the moment your fastest options for 1 or 2 CPUs. Only for 4 or more sockets AMD's Opterons can compete.
2) Intel's compilers generate code that is often significantly faster that code generated by other compilers (when used on AMD's CPUs you have to patch away some CPU checks Intel puts in to prevent AMD to look good).
3) No x86-CPU supports multiply-and-add yet, AMD's next architecure "Bulldozer" will probably be the first to support it.
4) High memory bandwidth you get on any AMD CPU and on Intel only for the new i7-architecture (socket 1366 is better than 775).
5) Use Intel's highly efficient libraries
if possible.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight