The fastest way to copy a 32bits array into 16bits arrays? - c

What is the best way to copy a 32bits array into 16bits arrays?
I know that "memcpy" uses hardware instruction.But is there a standard function to copy arrays with "changing size" in each element?
I use gcc for armv7 (cortex A8).
uint32_t tab32[500];
uint16_t tab16[500];
for(int i=0;i<500;i++)
tab16[i]=tab32[i];

On ARM cortex A8 with Neon instruction set, the fastest methods use interleaved read/write instructions:
vld2.16 {d0,d1}, [r0]!
vst1.16 {d0}, [r1]!
or saturating instructions to convert a vector of 32-bit integers to a vector of 16-bit integers.
Both of these methods are available in c using gcc intrinsic. It's also possible that gcc can autovectorize a carefully written c-code to use nothing but these particular instructions. This would basically require that there's a one to one correspondence with all the side effects of these instructions and the c code.

There is no standard function that does this, mostly because it would be very specific to your application.
If you know that the integers in tab32 will be small enough to fit in a uint16_t, the code in your question is probably the best you can get (the compiler will do the rest if it can optimize something).

Well if you don't need to modify the data you can use a pointer to uint16_t on the 32 bits array. It assume that the bare memory make sense as an array of 16 bits unsigned int.
edit: on hold, something is not clear in the question

Using memcpy will be the fastest way in my opinion. memcpy's are optimized separately for each architecture so you should be good.
On the other hand, since registers are 32bit in ARM, and 16bit values are zero/sign extended to 32bit in the back end anyway. So, I think, it would be more efficient to leave them as 32bit arrays and not copy the data into 16bit arrays (You should actually measure to make the correct decision).
There is one method which can save you size and improve performance (hopefully) If you store the incoming value in an int-array but each int would have two of 16bit values.
For example: int[4] would look like this:
----------------------------------------------------------------
| 32bit || 32bit || 32bit || 32bit |
----------------------------------------------------------------
| 16bit | 16bit|| 16bit | 16bit|| 16bit | 16bit|| 16bit | 16bit|
----------------------------------------------------------------
There will be a little preprocessing required(like reading the values as char(bytes) and then, (char*) typecasting on the int-array to store two values in one slot.
The last approach is not guaranteed to give you better performance unless all your algorithm (which you are going to apply on the array) work seamlessly with this layout of elements. Maybe you have to modify the algorithms a little bit to work with this data-structure. For e.g. some bit manipulation algorithms (and, or etc) can be applied to this data-structure without so much work.

Related

How to extract bytes from an SSE2 __m128i structure?

I'm a beginner with SIMD intrinsics, so I'll thank everyone for their patience in advance. I have an application involving absolute difference comparison of unsigned bytes (I'm working with greyscale images).
I tried AVX, more modern SSE versions etc, but eventually decided SSE2 seems sufficient and has the most support for individual bytes - please correct me if I'm wrong.
I have two questions: first, what's the right way to load 128-bit registers? I think I'm supposed to pass the load intrinsics data aligned to multiples of 128, but will that work with 2D array code like this:
greys = aligned_alloc(16, xres * sizeof(int8_t*));
for (uint32_t x = 0; x < xres; x++)
{
greys[x] = aligned_alloc(16, yres * sizeof(int8_t*));
}
(The code above assumes xres and yres are the same, and are powers of two). Does this turn into a linear, unbroken block in memory? Could I then, as I loop, just keep passing addresses (incrementing them by 128) to the SSE2 load intrinsics? Or does something different need to be done for 2D arrays like this one?
My second question: once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i ? Looking through the Intel Intrinsics Guide, instructions that convert a vector type to a scalar one are rare. The closest I've found is int _mm_movemask_epi8 (__m128i a) but I don't quite understand how to use it.
Oh, and one third question - I assumed _mm_load_si128 only loads signed bytes? And I couldn't find any other byte loading function, so I guess you're just supposed to subtract 128 from each and account for it later?
I know these are basic questions for SIMD experts, but I hope this one will be useful to beginners like me. And if you think my whole approach to the application is wrong, or I'd be better off with more modern SIMD extensions, I'd love to know. I'd just like to humbly warn I've never worked with assembly and all this bit-twiddling stuff requires a lot of explication if it's to help me.
Nevertheless, I'm grateful for any clarification available.
In case it makes a difference: I'm targeting a low-power i7 Skylake architecture. But it'd be nice to have the application run on much older machines too (hence SSE2).
Least obvious question first:
once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i
Extract the low 64 bits to an integer with int64_t _mm_cvtsi128_si64x(__m128i), or the low 32 bits with int _mm_cvtsi128_si32 (__m128i a).
If you want other parts of the vector, not the low element, your options are:
Shuffle the vector to create a new __m128i with the data you want in the low element, and use the cvt intrinsics (MOVD or MOVQ in asm).
Use SSE2 int _mm_extract_epi16 (__m128i a, int imm8), or the SSE4.1 similar instructions for other element sizes such as _mm_extract_epi64(v, 1); (PEXTRB/W/D/Q) are not the fastest instructions, but if you only need one high element, they're about equivalent to a separate shuffle and MOVD, but smaller machine code.
_mm_store_si128 to an aligned temporary array and access the members: compilers will often optimize this into just a shuffle or pextr* instruction if you compile with -msse4.1 or -march=haswell or whatever. print a __m128i variable shows an example, including Godbolt compiler output showing _mm_store_si128 into an alignas(16) uint64_t tmp[2]
Or use union { __m128i v; int64_t i64[2]; } or something. Union-based type punning is legal in C99, but only as an extension in C++. This is compiles the same as a tmp array, and is generally not easier to read.
An alternative to the union that would also work in C++ would be memcpy(&my_int64_local, 8 + (char*)my_vector, 8); to extract the high half, but that seems more complicated and less clear, and more likely to be something a compiler wouldn't "see through". Compilers are usually pretty good about optimizing away small fixed-size memcpy when it's an entire variable, but this is just half of the variable.
If the whole high half of a vector can go directly into memory unmodified (instead of being needed in an integer register), a smart compiler might optimize to use MOVHPS to store the high half of a __m128i with the above union stuff.
Or you can use _mm_storeh_pi((__m64*)dst, _mm_castsi128_ps(vec)). That only requires SSE1, and is more efficient than SSE4.1 pextrq on most CPUs. But don't do this for a scalar integer you're about to use again right away; if SSE4.1 isn't available it's likely the compiler will actually MOVHPS and integer reload, which usually isn't optimal. (And some compilers like MSVC don't optimize intrinsics.)
Does this turn into a linear, unbroken block in memory?
No, it's an array of pointers to separate blocks of memory, introducing an extra level of indirection vs. a proper 2D array. Don't do that.
Make one large allocation, and do the index calculation yourself (using array[x*yres + y]).
And yes, load data from it with _mm_load_si128, or loadu if you need to load from an offset.
assumed _mm_load_si128 only loads signed bytes
Signed or unsigned isn't an inherent property of a byte, it's only how you interpret the bits. You use the same load intrinsic for loading two 64-bit elements, or a 128-bit bitmap.
Use intrinsics that are appropriate for your data. It's a little bit like assembly language: everything is just bytes, and the machine will do what you tell it with your bytes. It's up to you to choose a sequence of instructions / intrinsics that produces meaningful results.
The integer load intrinsics take __m128i* pointer args, so you have to use _mm_load_si128( (const __m128i*) my_int_pointer ) or similar. This looks like pointer aliasing (e.g. reading an array of int through a short *), which is Undefined Behaviour in C and C++. However, this is how Intel says you're supposed to do it, so any compiler that implements Intel's intrinsics is required to make this work correctly. gcc does so by defining __m128i with __attribute__((may_alias)).
See also Loading data for GCC's vector extensions which points out that you can use Intel intrinsics for GNU C native vector extensions, and shows how to load/store.
To learn more about SIMD with SSE, there are some links in the sse tag wiki, including some intro / tutorial links.
The x86 tag wiki has some good x86 asm / performance links.

Embedded C compile size is bigger with int16's than with int32's. Why?

I am working on embedded C firmware for Freescale Coldfire processors. After writing some code, I began to look at ways to reduce the size of the build. We are limited on space, so this is important for me to consider.
I realized I had several int32's in my code, but I only need int16's for them. To save space I tried replacing the relevant variables with int16's. When I built it, the size of the build went up by about 60 bytes.
I thought it might be how my structs were packed, so I defined how I wanted it packed, but it made no change.
#pragma pack(push, 1)
// Struct here
#pragma pack(pop)
I could kind of see it staying the same, but I can't figure what would cause it to go up. Any thoughts here? What might be causing this?
Edit:
Yes, looks like it was simply generating extra instructions to account for 32-bit being the optimized size for the processor. I should have checked the datasheet first.
This was the extra assembly being generated:
0x00000028 0x3210 move.w (a0),d1
0x0000002A 0x48C1 ext.l d1
; Other instructions between
0x0000002E 0x3028000E move.w 14(a0),d0
0x00000032 0x48C0 ext.l d0**
Your compiler probably emits code for int32's just fine; it's probably the natural int size for your architectiure (is this true? Is sizeof(int)==4?).
I am guessing that the size increase is from three places:
Making alignment work
32-bit ints probably align naturally on the stack and other places,
so typically code would not have to be emitted to make sure "the
stack is 4 byte aligned". If you sprinkle a bunch of 16 bit ints in
your code, it may have to add padding (extra adds to frame pointer
as a fix-up?) Usually one add instruction covers the frame/stack
maintenance, but maybe extra instructions are emitted to guarantee
alignment.
Translating between 16-bit and 32-bit ints
With 32-bit ints, most instructions naturally work. With smaller ints, sometimes the compiler has to emit code that chops/slices up bits so that
it preserves the semantics of the smaller. (Maybe doing an extra AND instruction to mask
off some high-order bits or an OR instruction to set some bits).
Going back and forth to memory
Standard LOADS and STORES are for 32-bit ints (which is probably natural size of your machine). It's possible, that when it has to store only 2 bytes instead of 4, that the architecture has to emit extra instructions to store a non-standard int (either by chopping up the int, using a strange instruction that has a longer encoding, or using bit instructions to chop up the instruction).
These are all guesses. The best way to see is to look at the assembly code and see what's going on!
To save space I tried replacing the relevant variables with int16's.
When I built it, the size of the build went up by about 60 bytes.
This doesn't make much sense to me. Using a smaller data type doesn't necessary translate to fewer instructions. It could reduce memory use when running the software but not necessarily build size.
So, for example, the reason using smaller data types here could be increasing the size of your binaries could be due to the fact that using smaller types requires more instructions or longer instructions. For example, for non-word/dword aligned memory, the compiler may have to use more instructions for unaligned moves. It may have to use special instructions to extract lower/upper words if all the general-purpose registers are larger. In that case, you might also get a slight performance hit in addition to the increased binary size using those smaller types (but less memory use when the code is running).
There may be a number of scenarios and it's specific to both the exact compiler you are using and architecture (the assembly code will reveal the exact cause), but in short, using smaller types for variables does not necessarily mean smaller-sized builds/fewer instructions, and could easily mean the opposite.

Difference between int32_t and int_fast32_t [duplicate]

This question already has answers here:
What is the difference between intXX_t and int_fastXX_t?
(4 answers)
Closed 9 years ago.
What is the difference between the two? I know that int32_t is exactly 32 bits regardless of the environment but, as its name suggests that it's fast, how much faster can int_fast32_t really be compared to int32_t? And if it's significantly faster, then why so?
C is specified in terms of an idealized, abstract machine. But real-world hardware has behavioural characteristics that are not captured by the language standard. The _fast types are type aliases that allow each platform to specify types which are "convenient" for the hardware.
For example, if you had an array of 8-bit integers and wanted to mutate each one individually, this would be rather inefficient on contemporary desktop machines, because their load operations usually want to fill an entire processor register, which is either 32 or 64 bit wide (a "machine word"). So lots of loaded data ends up wasted, and more importantly, you cannot parallelize the loading and storing of two adjacent array elements, because they live in the same machine word and thus need to be load-modify-stored sequentially.
The _fast types are usually as wide as a machine word, if that's feasible. That is, they may be wider than you need and thus consume more memory (and thus are harder to cache!), but your hardware may be able to access them faster. It all depends on the usage pattern, though. (E.g. an array of int_fast8_t would probably be an array of machine words, and a tight loop modifying such an array may well benefit significantly.)
The only way to find out whether it makes any difference is to compare!
int32_t is an integer which is exactly 32bits. It is useful if you want for example to create a struct with an exact memory placement.
int_fast32_t is the "fastest" integer for your current processor that is at last bigger or equal to an int32_t. I don't know if there is really a gain for current processors (x86 or ARM)
But I can at last outline a real case : I used to work with a 32bits PowerPC processor. When accessing misaligned 16bits int16_t, it was inefficient for it has to first realign them in one of its 32bits registers. For non memory-mapped data, since we didn't have memory restrictions, it was more efficient to use int_fast16_t (which were in fact 32bits int).

Is a 32 bit integer on a 16 bit machine possible?

Just wondering if it possible. If yes, are there other ways besides compiler emulation layer?
Thanks
It's processor-dependent. Some processors have special instructions to manipulate register pairs (e.g. the 8-bit AVR instruction set has instructions for 16-bit register pairs). On processors without such native support, the compiler usually emits instructions that work with pairs of registers at a time (this is what is usually done to support 64-bit numbers on 32-bit processors, for example).
Yes, it is possible. Look at the Z80 from the 70s as an example of a 8-bit processor that can manipulate 16-bit values.
Make sure you know what "16-bit processor" means because I have found a lot of people have a misconception about it. Does it mean the opcode size, because some processors have variable width operations? Does it mean the addressing size? Does it mean the smallest/largest value it can natively manipulate?
And as far as at compile-time, sure. Check out arbitrary large number libraries (aka "big nums").

Word and Double Word integers in C

I am trying to implement a simple, moderately efficient bignum library in C. I would like to store digits using the full register size of the system it's compiled on (presumably 32 or 64-bit ints). My understanding is that I can accomplish this using intptr_t. Is this correct? Is there a more semantically appropriate type, i.e. something like intword_t?
I also know that with GCC I can easily do overflow detection on a 32-bit machine by upcasting both arguments to 64-bit ints, which will occupy two registers and take advantage of instructions like IA31 ADC (add with carry). Can I do something similar on a 64-bit machine? Is there a 128-bit type I can upcast to which will compile to use these instructions if they're available? Better yet, is there a standard type that represents twice the register size (like intdoubleptr_t) so this could be done in a machine independent fashion?
Thanks!
Any reason not to use size_t? size_t is 4 bytes on a 32-bit system and 8 bytes on a 64-bit system, and is probably more portable than using WORD_SIZE (I think WORD_SIZE is gcc-specific, no?)
I am not aware of any 128-bit value on 64-bit systems, could be wrong here but haven't come across that type in the kernel or regular user apps.
I'd strongly recommend using the C99 <stdint.h> header. It declares int32_t, int64_t, uint32_t, and uint64_t, which look like what you really want to use.
EDIT: As Alok points out, int_fast32_t, int_fast64_t, etc. are probably what you want to use. The number of bits you specify should be the minimum you need for the math to work, i.e. for the calculation to not "roll over".
The optimization comes from the fact that the CPU doesn't have to waste cycles realigning data, padding the leading bits on a read, and doing a read-modify-write on a write. Truth is, a lot of processors (such as recent x86s) have hardware in the CPU that optimizes these access pretty well (at least the padding and read-modify-write parts), since they're so common and usually only involve transfers between the processor and cache.
So the only thing left for you to do is make sure the accesses are aligned: take sizeof(int_fast32_t) or whatever and use it to make sure your buffer pointers are aligned to that.
Truth is, this may not amount to that much improvement (due to the hardware optimizing transfers at runtime anyway), so writing something and timing it may be the only way to be sure. Also, if you're really crazy about performance, you may need to look at SSE or AltiVec or whatever vectorization tech your processor has, since that will outperform anything you can write that is portable when doing vectored math.

Resources