memcpy optimization in cortex-a8 arm - c

i use memcpy() in my implementation on ARM Cortex a8,
it is my first code to develop on ARM Processors.
i read in the following link that i can optimize performance through some strategies.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html
my code says
..
memcpy(myvar1, myvar2, myvar_size * sizeof(var_complex));
..
how can i optimize this code for ARM Cortex-a8 using Eclipse GCC Toolchain.
my code contains both C and Assembly codes.
is that affects using some registers ?
i searched for some examples and didn't found.

Related

_mm_mulhi_epi16 equivalent on ARM

I am working on a project to accelerate the perf on ARM platform with NEON intrinsics.
I could not find the direct equivalents for below intrinsics
_mm_mulhi_epi16
_mm_hadd_epi32
_mm_maddubs_epi16
_mm_madd_epi16
_mm_extract_epi8
Equivalent intrinsics will help a lot in my efforts
_mm_hadd_epi32 appears to match vpaddq_s32.
_mm_extract_epi8 appears to match vgetq_lane_s8.
Not sure about the others offhand.

ARM ASM: Bad Instruction end

In my current project we are using Segger embOS as an RTOS.
The target system is an ARM Cortex-M MCU
The RTOS has some code written in assembler.
However the ASM code produces an error:
RTOS.s:69: Error: bad instruction `end'
According to the ARM assembler reference guide
http://infocenter.arm.com/help/topic/com.arm.doc.dui0489f/DUI0489F_arm_assembler_reference.pdf
(Chapter 6.8.5) the instruction "END" exists (I'm not sure if assembler is case sensitive)
although this instruction exists, the assembly won't compile.
Each of the includes files terminate with an
.end (note the "." and the lower case letters)
File RTOS.s
#define OS_RTOS_S_INCLUDED
/*******************************************************************
*
* Code section includes selected code
*
********************************************************************
*/
#if (defined __ARM_ARCH_6M__) || (defined __ARM_ARCH_8M_BASE__)
//
// Cortex-M0
//
#include "RTOS_CM0.S"
#elif (defined (__VFP_FP__) && defined (__SOFTFP__))
//
// Cortex-M3 or Cortex-M4 without VFP
//
#include "RTOS_CM3.S"
#elif (defined (__VFP_FP__) && !defined (__SOFTFP__))
//
// Cortex-M4 with VFP
//
#include "RTOS_CM4F.S"
#else
#error "No RTOS.S for selected CPU available, check configuration"
#endif
/********************************************************************/
END//Line 69
/***** End of file ************************************************/
Switch the END to .end seams to resolve the compile error. However the function defined in the assembler script are not found by the linker (this could be different problem though)
So my question is: Why is the instruction END a bad instruction?
The END directive is an armasm directive, not an ARM assembly instruction. That is to say it is an instruction to the assembler during the build of the code, not an instruction to the processor. .end is the GNU as (GNU assembler) equivalent.
Different toolchains use different assembler directives and syntax. You are trying to build the armasm source code using gas (GNU assembler) which is not compatible. You will certainly encounter other issues than this that will prevent you building ARM toolchain specific source code/object with the GNU toolchain - not least, apart from the technical issues, there are legal issues given that embOS licences are toolchain specific.
Each Segger embOS license is provided for a specific toolchain. If you wish to use a different toolchain you will need a new license and different toolchain specific code/library - even if you have a source code licence; it is not just a legal issue but a technical one - Segger do not provide the code for all toolchains with a license for a single toolchain. If you only have an object code licence, it may not link if using a different toolchain (or in some cases even different toolchain version) than the object code was built with.
You need to check, but it is likely that you have a licence for the Keil ARM MDK toolchain (which includes armcc/armasm etc.). It is not a free tool one way or another you need to either purchase an embOS licence for GNU, or licence to the toolchain you have the embOS licence for.
You might do well in any case to update your Segger support and maintenance licence in any case, so you can get technical support from them.

ARMv8A AArch64 vmlal_high_s8 Intrinsics

I'm looking for the intrinsic corresponding to the operation 'SMLAL2 Vd.8H,Vn.16B,Vm.16B', which according to ARM's own documentation (ARM Neon Intrinsics Ref) should be something like
int16x8_t vmlal_high_s8 (int16x8_t a,int8x16_t b,int8x16_t c)
however in the arm_neon.h that is included in ARM's GNU Toolchain doesn't have anything corresponding to it. So my question would be if I have to just include something else or otherwise can somehow circumvent this problem.
Thanks in advance!
For anyone else hitting this problem: I had chosen the ARM embedded tool chain instead of the linaro one, which is suitable for aarch64

How to check with Intel intrinsics if AVX extensions is supported by the CPU?

I'm writing a program using Intel intrinsics. I want to use _mm_permute_pd intrinsic, which is only available on CPUs with AVX. For CPUs without AVX I can use _mm_shuffle_pd but according to the specs it is much slower than _mm_permute_pd. Do the header files for Intel intrinsics define constants that allow me to distinguish whether AVX is supported so that I can write sth like this:
#ifdef __IS_AVX_SUPPORTED__ // is there sth like this defined?
// use _mm_permute_pd
# else
// use _mm_shuffle_pd
#endif
? I have found this tutorial, which shows how to perform a runtime check but I need to do a static, compile-time check for the current machine.
GCC, ICC, MSVC, and Clang all define a macro __AVX__ which you can check. In fact it's the only SIMD constant defined by all those compilers (MSVC is the one that breaks the mold). This only tells you if your code was compiled with AVX support (e.g. -mavx with GCC or /arch:AVX with MSVC) it does not tell you if your CPU supports AVX. If you want to know if the CPU supports AVX you need to check CPUID. Here, asm-in-c-error, is an example to read CPUID from all those compilers.
To do this properly I suggest you make a CPU dispatcher.
Edit: In case anyone wants to know how to use the values from CPUID to find out if AVX is available see https://github.com/Mysticial/FeatureDetector
I assume you are using Intel C++ Compiler. In this case - yes, there are such macros: Intel C++ Compiler Reference Guide: __AVX__, __AVX2__.
P.S. Be aware that if you compile you application with AVX instruction set enabled it will fail on CPUs not supporting AVX. If you are going to distribute your software as source code package and compile on target machine - this is may be a viable solution. Otherwise you should check for AVX dynamically.
P.P.S. There are several options for ICC. Take a look at the following compiler options and also references from it to other.
It seems to me that the only way is to compile and run a program that identifies whether AVX is available. Then manually or automatically compile separate code with or without AVX functions. For VS 2013, I would used my code in commomAVX folder in the following to identify hasAVX (or not) and use this to execute one of two different BAT files to compile and link the appropriate program.
http://www.roylongbottom.org.uk/gigaflops-benchmarks.zip
My question was to help to identify a solution regarding the use of suitable compile options such as /arch:AVX.

Writing a piece of C code such that compiler uses SSE4.1 instruction for generating assembly Code

I want to write some C code such that gcc using the -msse4.1 flag can optimize it. Basically I want to check whether or not the compiler is taking advantage of SSE4.1 instructions.
There are many SSE4.1 instructions (http://en.wikipedia.org/wiki/SSE4#New_instructions) but I am not able to write a fragment of C Code which is using any of those instructions in the generated assembly code.
Thanks in advance.
From what I've seen, compilers rarely ever generate SSE4.1 instructions. I've seen a few cases where it will use the insert/extract instructions to pack data.
But for the most part, if you want to use the SSE4.1 instructions, you need to do them explicitly using intrinsics:
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_bk_sse41.htm
I doubt GCC would emit SSE4.1 instructions that easily. But you could have a look at Intel SPMD Program Compiler:
Under the SPMD model, the programmer writes a program that mostly
appears to be a regular serial program, though the execution model is
actually that a number of program instances execute in parallel on the
hardware. (See a more detailed example that illustrates this concept.)
ispc compiles a C-based SPMD programming language to run on the SIMD
units of CPUs; it frequently provides a 3x or more speedup on CPUs
with 4-wide SSE units, without any of the difficulty of writing
intrinsics code.

Resources