Is there a standard, strided version of memcpy?

Is there a standard, strided version of memcpy? - c

I have a column vector A which is 10 elements long. I have a matrix B which is 10 by 10. The memory storage for B is column major. I would like to overwrite the first row in B with the column vector A.
Clearly, I can do:
for ( int i=0; i < 10; i++ )
{
B[0 + 10 * i] = A[i];
}
where I've left the zero in 0 + 10 * i to highlight that B uses column-major storage (zero is the row-index).
After some shenanigans in CUDA-land tonight, I had a thought that there might be a CPU function to perform a strided memcpy?? I guess at a low-level, performance would depend on the existence of a strided load/store instruction, which I don't recall there being in x86 assembly?

Short answer: The code you have written is as fast as it's going to get.
Long answer: The memcpy function is written using some complicated intrinsics or assembly because it operates on memory operands that have arbitrary size and alignment. If you are overwriting a column of a matrix, then your operands will have natural alignment, and you won't need to resort to the same tricks to get decent speed.

Related

Maximizing the performance and efficiency of triangularizing a 24x24 matrix in C and then in MIPS assembly

As of recently an interest within the realm of computer architecture and performance has been sparked in me. With that said, I have been picking up an "easier" assembly language to really try and learn how stuff "works under the hood". Namely MIPS assembly. I feel comfortable enough to try and experiment with some more advanced stuff and as such I have decided to combine programming with my interest in mathematics.
My goal is simple, given a 24x24 (I don't care about any other size) matrix A, I want to write an algorithm that as efficiently as possible finds the upper triangular form of the matrix. With efficiently I mean that I want to eventually end up in a state where I use the processor's that I am using resources the best I can. High cache hit rate, efficient usage of memory (locality of reference principle etc.), performance as in time it takes to run the solution, etc.
Eventually my goal is to transform the C solution to MIPS-assembly and tailor it to fit the memory subsystem of the processor that I will be trying to run my algorithm on. Regarding the processor I will have different options to play around with when it comes to caches, write buffers and memory in the sense that I can play around with different cache sizes, block sizes, associativity levels, memory access times etc. Performance in this case will be measured in the time it takes to triangularize a 24x24 matrix.
To begin, I need to actually write some high level code and actually solve the problem there before diving into MIPS assembly. I have "looked around" and eventually came up with this seemingly standard solution. It isn't necessarily super fast, neither do I think it is optimal for triangularizing 24x24 matrices. Can I do better?
void triangularize(float **A, int N)
{
int i, j, k;
// Loop over the diagonal elements
for (k = 0; k < N; k++)
{
// Loop over all the elements in the pivot row and right of the pivot ELEMENT
for (j = k + 1; j < N; j++)
{
// divide by the pivot element
A[k][j] = A[k][j] / A[k][k];
}
// Set the pivot elements
A[k][k] = 1.0;
// Loop over all elements below the pivot right an right of the pivot COLUMN
for (i = k + 1; i < N; i++)
{
for (j = k + 1; j < N; j++)
{
A[i][j] = A[i][j] - A[i][k] * A[k][j];
}
A[i][k] = 0.0;
}
}
}
Furthermore, what should be my next steps when trying to convert the C code to MIPS assembly with respect to maximizing performance and minimizing cost (cache hit rates, IO costs when dealing with memory etc.) to get a lightning fast and efficient solution?

First of all, encoding a matrix as a jagged array (ie. float**) is generally not efficient as it cause unnecessary expensive indirections and the array may not be contiguous in memory resulting in more cache misses or even cache trashing in pathological cases. It is certainly better to copy the matrix in a contiguous flatten array. Please consider storing your matrices as flatten arrays that are generally more efficient (especially on MIPS). Flatten array can be indexed using something like array[i*24+j] instead of array[i][j].
Moreover, if you do not care about matrices other than 24x24 ones, then you can write a specialized code for 24x24 matrices. This help compilers to generate a more efficient assembly code (typically by unrolling loops and using more efficient instructions like multiplication by a constant).
Additionally, divisions are generally expensive, especially on embedded MIPS processors. Thus, you can replace divisions by multiplications with the inverse. For example:
float inv = 1.0f / A[k][k];
for (j = k + 1; j < N; j++)
A[k][j] *= inv;
Note that the result might be slightly different due to floating-point rounding. You can use the -ffast-math compiler flag so to help it generating such optimisation if you know that special values like NaN or Inf do not appear in the matrix.
Moreover, it may be faster to unroll the loop manually since not all compilers do that (properly). That being said, the benefit of loop unrolling is very dependent of the target processor (unspecified here). Without more information, it is very hard to know if this is useful. For example, some processor can execute multiple floating-point operation per cycles while some other cannot even do that natively (ie. no hardware FP unit): they are somehow emulated with many instruction which is very expensive (compilers like GCC do function calls for basic operations like addition/subtraction on such processors). If there is no hardware FP unit, then it might be faster to use fixed precision.
Finally, some MIPS processors have a 128-bit SIMD unit. Using it should significantly speed up the execution. Compilers should be able to mostly auto-vectorize your code but you need to tell them if your target processor support it (see the -march flag for GCC/Clang). For a fixed-size matrix, manual vectorization often result in a faster execution (than auto-vectorisation) assuming you write an efficient code.

Matrix Multiplication of size 100*100 using SSE Intrinsics

int MAX_DIM = 100;
float a[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16)));
float b[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16)));
float d[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16)));
/*
* I fill these arrays with some values
*/
for(int i=0;i<MAX_DIM;i+=1){
for(int j=0;j<MAX_DIM;j+=4){
for(int k=0;k<MAX_DIM;k+=4){
__m128 result = _mm_load_ps(&d[i][j]);
__m128 a_line = _mm_load_ps(&a[i][k]);
__m128 b_line0 = _mm_load_ps(&b[k][j+0]);
__m128 b_line1 = _mm_loadu_ps(&b[k][j+1]);
__m128 b_line2 = _mm_loadu_ps(&b[k][j+2]);
__m128 b_line3 = _mm_loadu_ps(&b[k][j+3]);
result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0x00), b_line0));
result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0x55), b_line1));
result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0xaa), b_line2));
result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0xff), b_line3));
_mm_store_ps(&d[i][j],result);
}
}
}
the above code I made to make matrix multiplication using SSE. the code runs as flows I take 4 elements from row from a multiply it by 4 elements from a column from b and move to the next 4 elements in the row of a and next 4 elements in column b
I get an error Segmentation fault (core dumped) I don't really know why
I use gcc 5.4.0 on ubuntu 16.04.5
Edit :
The segmentation fault was solved by _mm_loadu_ps
Also there is something wrong with logic i will be greatfull if someone helps me to find it

The segmentation fault was solved by _mm_loadu_ps Also there is something wrong with logic...
You're loading 4 overlapping windows on b[k][j+0..7]. (This is why you needed loadu).
Perhaps you meant to load b[k][j+0], +4, +8, +12? If so, you should align b by 64, so all four loads come from the same cache line (for performance). Strided access is not great, but using all 64 bytes of every cache line you touch is a lot better than getting row-major vs. column-major totally wrong in scalar code with no blocking.
I take 4 elements from row from a multiply it by 4 elements from a column from b
I'm not sure your text description describes your code.
Unless you've already transposed b, you can't load multiple values from the same column with a SIMD load, because they aren't contiguous in memory.
C multidimensional arrays are "row major": the last index is the one that varies most quickly when moving to the next higher memory address. Did you think that _mm_loadu_ps(&b[k][j+1]) was going to give you b[k+0..3][j+1]? If so, this is a duplicate of SSE matrix-matrix multiplication (That question is using 32-bit integer, not 32-bit float, but same layout problem. See that for a working loop structure.)
To debug this, put a simple pattern of values into b[]. Like
#include <stdalign.>
alignas(64) float b[MAX_DIM][MAX_DIM] = {
0000, 0001, 0002, 0003, 0004, ...,
0100, 0101, 0102, ...,
0200, 0201, 0202, ...,
};
// i.e. for (...) b[i][j] = 100 * i + j;
Then when you step through your code in the debugger, you can see what values end up in your vectors.
For your a[][] values, maybe use 90000.0 + 100 * i + j so if you're looking at registers (instead of C variables) you can still tell which values are a and which are b.
Related:
Ulrich Drepper's What Every Programmer Should Know About Memory shows an optimized matmul with cache-blocking with SSE instrinsics for double-precision. Should be straightforward to adapt for float.
How does BLAS get such extreme performance? (You might want to just use an optimized matmul library; tuning matmul for optimal cache-blocking is non-trivial but important)
Matrix Multiplication with blocks
Poor maths performance in C vs Python/numpy has some links to other questions
how to optimize matrix multiplication (matmul) code to run fast on a single processor core

ARM Neon in C: How to combine different 128bit data types while using intrinsics?

TLTR
For arm intrinsics, how do you feed a 128bit variable of type uint8x16_t into a function expecting uint16x8_t?
EXTENDED VERSION
Context: I have a greyscale image, 1 byte per pixel. I want to downscale it by a factor 2x. For each 2x2 input box, I want to take the minimum pixel. In plain C, the code will look like this:
for (int y = 0; y < rows; y += 2) {
uint8_t* p_out = outBuffer + (y / 2) * outStride;
uint8_t* p_in = inBuffer + y * inStride;
for (int x = 0; x < cols; x += 2) {
*p_out = min(min(p_in[0],p_in[1]),min(p_in[inStride],p_in[inStride + 1]) );
p_out++;
p_in+=2;
}
}
Where both rows and cols are multiple of 2. I call "stride" the step in bytes that takes to go from one pixel to the pixel immediately below in the image.
Now I want to vectorize this. The idea is:
take 2 consecutive rows of pixels
load 16 bytes in a from the top row, and load the 16 bytes immediately below in b
compute the minimum byte by byte between a and b. Store in a.
create a copy of a shifting it right by 1 byte (8 bits). Store it in b.
compute the minimum byte by byte between a and b. Store in a.
store every second byte of a in the output image (discards half of the bytes)
I want to write this using Neon intrinsics. The good news is, for each step there exists an intrinsic that match it.
For example, at point 3 one can use (from here):
uint8x16_t vminq_u8(uint8x16_t a, uint8x16_t b);
And at point 4 one can use one of the following using a shift of 8 bits (from here):
uint16x8_t vrshrq_n_u16(uint16x8_t a, __constrange(1,16) int b);
uint32x4_t vrshrq_n_u32(uint32x4_t a, __constrange(1,32) int b);
uint64x2_t vrshrq_n_u64(uint64x2_t a, __constrange(1,64) int b);
That's because I do not care what happens to byte 1,3,5,7,9,11,13,15 because anyway they will be discarded from the final result. (The correctness of this has been verified and it's not the point of the question.)
HOWEVER, the output of vminq_u8 is of type uint8x16_t, and it is NOT compatible with the shift intrinsics that I would like to use. In C++ I addressed the problem with this templated data structure, while I have been told that the problem cannot be reliably addressed using union (Edit: although that answer refer to C++, and in fact in C type punning IS allowed), nor by using pointers to cast, because this will break the strict aliasing rule.
What is the way to combine different data types while using ARM Neon intrinsics?

For this kind of problem, arm_neon.h provides the vreinterpret{q}_dsttype_srctype casting operator.
In some situations, you might want to treat a vector as having a
different type, without changing its value. A set of intrinsics is
provided to perform this type of conversion.
So, assuming a and b are declared as:
uint8x16_t a, b;
Your point 4 can be written as(*):
b = vreinterpretq_u8_u16(vrshrq_n_u16(vreinterpretq_u16_u8(a), 8) );
However, note that unfortunately this does not address data types using an array of vector types, see ARM Neon: How to convert from uint8x16_t to uint8x8x2_t?
(*) It should be said, this is much more cumbersome of the equivalent (in this specific context) SSE code, as SSE has only one 128 bit integer data type (namely __m128i):
__m128i b = _mm_srli_si128(a,1);

optimizing a line of C code for 8 bit processor

I'm working on a 8bit processor and have written code in a C compiler, now more than 140 lines of code are taking just 1200 bytes and this single line is taking more than 200 bytes of ROM space. eeprom_read() is a function, there should be a problem with this 1000 and 100 and 10 multiplication.
romAddr = eeprom_read(146)*1000 + eeprom_read(147)*100 +
eeprom_read(148)*10 + eeprom_read(149);
Processor is 8-bit and data type of romAddr is int. Is there any way to write this line in a more optimized way?

It's possible that the thing that uses the most space is the use of multiplication. If your processor lacks an instruction to do multiplication, the compiler is forced to use software to do it step by step, which can require quite a bit of code.
It's hard to say, since you don't specify anything about your target processor (or which compiler you're using).
One way might be to somehow try to reduce inlining, so the code to multiply by 10 (which is used in all four terms) can be re-used.
To know if this is the case at all, the machine code must be inspected. By the way, the use of decimal constants for an address calculation is really odd.

Sometimes the multiplication can be compiled into a sequence of additions, yes. You can optimize it say by using left shift operator.
A*1000 = A*512 + A*256 + A*128 + A*64 + A*32 + A*8
Or the same thing:
A<<9 + A<<8 + A<<7 + A<<6 + A<<5 + A<<3
This still is way longer then a single "multiply" instruction, but your processor apparently doesn't have it anyway, so this might be the next best thing.

You're concerned about space, not time, right?
You've got four function calls, with an integer argument being passed to each one, followed by a multiplication by a constant, followed by adding.
Just as a first guess, that could be
load integer constant into register (6 bytes)
push register (2 bytes,
call eeprom_read (6 bytes)
adjust stack (4 bytes)
load integer multiplier into register (6 bytes)
push both registers (4 bytes),
call multiplication routine (6 bytes)
adjust stack (4 bytes)
load temporary sum into a register (6 bytes)
add to that register the result of the multiplication (2 bytes)
store back in the temporary sum (6 bytes).
Let's see, 6+2+6+4+6+4+6+4+6+2+6= about 52 bytes per call to eeprom_read.
The last call would be shorter because it doesn't do the multiply.
I would try calling eeprom_read not with arguments like 146 but with (unsigned char)146, and multiplying not by 1000 but by (unsigned short)1000.
That way, you might be able to tease the compiler into using shorter instructions, and possibly using a multiply instruction rather than a multiply function call.
Also, the call to eeprom_read might be macro'ed into a direct memory fetch, saving the pushing of the argument, the calling of the function, and the stack adjustment.
Another trick could be to store each one of the four products in a local variable, and add them all together at the end. That could generate less code.
All these possibilities would also make it faster, as well as smaller, though you probably don't need to care about that.
Another possibility for saving space could be to use a loop, like this:
static unsigned short powerOf10[] = {1000, 100, 10, 1};
unsigned short i;
romAddr = 0;
for (i = 146; i < 150; i++){
romAddr += powerOf10[i-146] * eeprom_read(i);
}
which should save space by having the call and the multiply only once, plus the looping instructions, rather than four copies.
In any case, get handy with the assembler language that the compiler generates.

It depends very, very much on the compiler, but I would suggest that you at least simplify the multiplication this way:
romAddr = ((eeprom_read(146)*10 + eeprom_read(147))*10 +
eeprom_read(148))*10 + eeprom_read(149);
You could put this in a loop:
uint8_t i = 146;
romAddr = eeprom_read(i);
for (i = 147; i < 150; i++)
romAddr = romAddr * 10 + eeprom_read(i);
Hopefully the compiler should recognise how much simpler it is to multiply a 16-bit value by ten, compared with separately implementing multiplications by 1000 and 100.
I'm not completely comfortable relying on the compiler to deal with the loop effectively, though.
Maybe:
uint8_t hi, lo;
hi = (uint8_t)eeprom_read(146) * (uint8_t)10 + (uint8_t)eeprom_read(147);
lo = (uint8_t)eeprom_read(148) * (uint8_t)10 + (uint8_t)eeprom_read(149);
romAddr = hi * (uint8_t)100 + lo;
All of these are untested.

How to map a long integer number to a N-dimensional vector of smaller integers (and fast inverse)?

Given a N-dimensional vector of small integers is there any simple way to map it with one-to-one correspondence to a large integer number?
Say, we have N=3 vector space. Can we represent a vector X=[(int16)x1,(int16)x2,(int16)x3] using an integer (int48)y? The obvious answer is "Yes, we can". But the question is: "What is the fastest way to do this and its inverse operation?"
Will this new 1-dimensional space possess some very special useful properties?

For the above example you have 3 * 32 = 96 bits of information, so without any a priori knowledge you need 96 bits for the equivalent long integer.
However, if you know that your x1, x2, x3, values will always fit within, say, 16 bits each, then you can pack them all into a 48 bit integer.
In either case the technique is very simple you just use shift, mask and bitwise or operations to pack/unpack the values.

Just to make this concrete, if you have a 3-dimensional vector of 8-bit numbers, like this:
uint8_t vector[3] = { 1, 2, 3 };
then you can join them into a single (24-bit number) like so:
uint32_t all = (vector[0] << 16) | (vector[1] << 8) | vector[2];
This number would, if printed using this statement:
printf("the vector was packed into %06x", (unsigned int) all);
produce the output
the vector was packed into 010203
The reverse operation would look like this:
uint8_t v2[3];
v2[0] = (all >> 16) & 0xff;
v2[1] = (all >> 8) & 0xff;
v2[2] = all & 0xff;
Of course this all depends on the size of the individual numbers in the vector and the length of the vector together not exceeding the size of an available integer type, otherwise you can't represent the "packed" vector as a single number.

If you have sets Si, i=1..n of size Ci = |Si|, then the cartesian product set S = S1 x S2 x ... x Sn has size C = C1 * C2 * ... * Cn.
This motivates an obvious way to do the packing one-to-one. If you have elements e1,...,en from each set, each in the range 0 to Ci-1, then you give the element e=(e1,...,en) the value e1+C1*(e2 + C2*(e3 + C3*(...Cn*en...))).
You can do any permutation of this packing if you feel like it, but unless the values are perfectly correlated, the size of the full set must be the product of the sizes of the component sets.
In the particular case of three 32 bit integers, if they can take on any value, you should treat them as one 96 bit integer.
If you particularly want to, you can map small values to small values through any number of means (e.g. filling out spheres with the L1 norm), but you have to specify what properties you want to have.
(For example, one can map (n,m) to (max(n,m)-1)^2 + k where k=n if n<=m and k=n+m if n>m--you can draw this as a picture of filling in a square like so:
1 2 5 | draw along the edge of the square this way
4 3 6 v
8 7
if you start counting from 1 and only worry about positive values; for integers, you can spiral around the origin.)

I'm writing this without having time to check details, but I suspect the best way is to represent your long integer via modular arithmetic, using k different integers which are mutually prime. The original integer can then be reconstructed using the Chinese remainder theorem. Sorry this is a bit sketchy, but hope it helps.

To expand on Rex Kerr's generalised form, in C you can pack the numbers like so:
X = e[n];
X *= MAX_E[n-1] + 1;
X += e[n-1];
/* ... */
X *= MAX_E[0] + 1;
X += e[0];
And unpack them with:
e[0] = X % (MAX_E[0] + 1);
X /= (MAX_E[0] + 1);
e[1] = X % (MAX_E[1] + 1);
X /= (MAX_E[1] + 1);
/* ... */
e[n] = X;
(Where MAX_E[n] is the greatest value that e[n] can have). Note that these maximum values are likely to be constants, and may be the same for every e, which will simplify things a little.
The shifting / masking implementations given in the other answers are a generalisation of this, for cases where the MAX_E + 1 values are powers of 2 (and thus the multiplication and division can be done with a shift, the addition with a bitwise-or and the modulus with a bitwise-and).

There is some totally non portable ways to make this real fast using packed unions and direct accesses to memory. That you really need this kind of speed is suspicious. Methods using shifts and masks should be fast enough for most purposes. If not, consider using specialized processors like GPU for wich vector support is optimized (parallel).
This naive storage does not possess any usefull property than I can foresee, except you can perform some computations (add, sub, logical bitwise operators) on the three coordinates at once as long as you use positive integers only and you don't overflow for add and sub.
You'd better be quite sure you won't overflow (or won't go negative for sub) or the vector will become garbage.

#include <stdint.h> // for uint8_t
long x;
uint8_t * p = &x;
or
union X {
long L;
uint8_t A[sizeof(long)/sizeof(uint8_t)];
};
works if you don't care about the endian. In my experience compilers generate better code with the union because it doesn't set of their "you took the address of this, so I must keep it in RAM" rules as quick. These rules will get set off if you try to index the array with stuff that the compiler can't optimize away.
If you do care about the endian then you need to mask and shift.

I think what you want can be solved using multi-dimensional space filling curves. The link gives a lot of references on this, which in turn give different methods and insights. Here's a specific example of an invertible mapping. It works for any dimension N.
As for useful properties, these mappings are related to Gray codes.
Hard to say whether this was what you were looking for, or whether the "pack 3 16-bit ints into a 48-bit int" does the trick for you.