I am working in a space limited environment. I collect an array of unsigned 32 bit ints via DMA, but I need to work on them as single precision floats using DSP extensions in the MCU. Copying the array is not possible - it takes up almost all existing SRAM. Is there a neat way to do this?
[Note] The data values are only 12 bits so out of range problems will not exist
You can just do it like this:
uint32_t a[N];
float *f = (float *)a;
for (i = 0; i < N; ++i)
{
f[i] = (float)a[i];
}
Note that this breaks strict aliasing rules so you should compile with -fno-strict-aliasing or equivalent.
Related
I have the following code snippet,
int main()
{
int loop;
char * src = 0x20000000;
char * dest = 0x20000008;
for(loop = 0; loop < 8; loop++)
dest [loop] = src [loop];
}
Is this a valid code? How to optimize the logic to reduce looping?
Assuming the compiler doesn't do it automatically, it can be optimized if there a couple of assumptions in place:
Size of char is 1 byte
The CPU instruction set supports 8-byte operations (for non-64-bit platforms the implementation may vary)
Cast (or directly define) the source and destination addresses to unsigned long long*
Perform direct assignment from source to destination. If the platform instruction set supports 64-bit operations, it should result in copy of 64-bit data chunk from source to destination. E.g., on Intel CPUs it can be done with a single movsq assembly instruction.
int main()
{
unsigned long long * src = 0x20000000;
unsigned long long * dest = 0x20000008;
*dest = *src;
return 0;
}
Is this a valid code?
You can use your compiler to find that out. The loop part in particular is valid code, which I suppose is what you're really asking about.
How to optimize the logic to reduce looping?
Turn on optimization in your compiler, it will take care of the rest. There's no way to improve on that code from a performance perspective, though you could use memcpy() to make the code more concise and easier to read.
TLTR
For arm intrinsics, how do you feed a 128bit variable of type uint8x16_t into a function expecting uint16x8_t?
EXTENDED VERSION
Context: I have a greyscale image, 1 byte per pixel. I want to downscale it by a factor 2x. For each 2x2 input box, I want to take the minimum pixel. In plain C, the code will look like this:
for (int y = 0; y < rows; y += 2) {
uint8_t* p_out = outBuffer + (y / 2) * outStride;
uint8_t* p_in = inBuffer + y * inStride;
for (int x = 0; x < cols; x += 2) {
*p_out = min(min(p_in[0],p_in[1]),min(p_in[inStride],p_in[inStride + 1]) );
p_out++;
p_in+=2;
}
}
Where both rows and cols are multiple of 2. I call "stride" the step in bytes that takes to go from one pixel to the pixel immediately below in the image.
Now I want to vectorize this. The idea is:
take 2 consecutive rows of pixels
load 16 bytes in a from the top row, and load the 16 bytes immediately below in b
compute the minimum byte by byte between a and b. Store in a.
create a copy of a shifting it right by 1 byte (8 bits). Store it in b.
compute the minimum byte by byte between a and b. Store in a.
store every second byte of a in the output image (discards half of the bytes)
I want to write this using Neon intrinsics. The good news is, for each step there exists an intrinsic that match it.
For example, at point 3 one can use (from here):
uint8x16_t vminq_u8(uint8x16_t a, uint8x16_t b);
And at point 4 one can use one of the following using a shift of 8 bits (from here):
uint16x8_t vrshrq_n_u16(uint16x8_t a, __constrange(1,16) int b);
uint32x4_t vrshrq_n_u32(uint32x4_t a, __constrange(1,32) int b);
uint64x2_t vrshrq_n_u64(uint64x2_t a, __constrange(1,64) int b);
That's because I do not care what happens to byte 1,3,5,7,9,11,13,15 because anyway they will be discarded from the final result. (The correctness of this has been verified and it's not the point of the question.)
HOWEVER, the output of vminq_u8 is of type uint8x16_t, and it is NOT compatible with the shift intrinsics that I would like to use. In C++ I addressed the problem with this templated data structure, while I have been told that the problem cannot be reliably addressed using union (Edit: although that answer refer to C++, and in fact in C type punning IS allowed), nor by using pointers to cast, because this will break the strict aliasing rule.
What is the way to combine different data types while using ARM Neon intrinsics?
For this kind of problem, arm_neon.h provides the vreinterpret{q}_dsttype_srctype casting operator.
In some situations, you might want to treat a vector as having a
different type, without changing its value. A set of intrinsics is
provided to perform this type of conversion.
So, assuming a and b are declared as:
uint8x16_t a, b;
Your point 4 can be written as(*):
b = vreinterpretq_u8_u16(vrshrq_n_u16(vreinterpretq_u16_u8(a), 8) );
However, note that unfortunately this does not address data types using an array of vector types, see ARM Neon: How to convert from uint8x16_t to uint8x8x2_t?
(*) It should be said, this is much more cumbersome of the equivalent (in this specific context) SSE code, as SSE has only one 128 bit integer data type (namely __m128i):
__m128i b = _mm_srli_si128(a,1);
I am trying to exploit the SIMD 512 offered by knc (Xeon Phi) to improve performance of the below C code using intel intrinsics. However, my intrinsic embedded code runs slower than auto-vectorized code
C Code
int64_t match=0;
int *myArray __attribute__((align(64)));
myArray = (int*) malloc (sizeof(int)*SIZE); //SIZE is array size taken from user
radomize(myArray); //to fill some random data
int searchVal=24;
#pragma vector always
for(int i=0;i<SIZE;i++) {
if (myArray[i]==searchVal) match++;
return match;
Intrinsic embedded code:
In the below code I am first loading the array and comparing it with search key. Intrinsics return 16bit mask values that is reduced using _mm512_mask_reduce_add_epi32().
register int64_t match=0;
int *myArray __attribute__((align(64)));
myArray = (int*) malloc (sizeof(int)*SIZE); //SIZE is array size taken from user
const int values[16]=\
{ 1,1,1,1,\
1,1,1,1,\
1,1,1,1,\
1,1,1,1,\
};
__m512i const flag = _mm512_load_epi32((void*) values);
__mmask16 countMask;
__m512i searchVal = _mm512_set1_epi32(16);
__m512i kV = _mm512_setzero_epi32();
for (int i=0;i<SIZE;i+=16)
{
// kV = _mm512_setzero_epi32();
kV = _mm512_loadunpacklo_epi32(kV,(void* )(&myArray[i]));
kV = _mm512_loadunpackhi_epi32(kV,(void* )(&myArray[i + 16]));
countMask = _mm512_cmpeq_epi32_mask(kV, searchVal);
match += _mm512_mask_reduce_add_epi32(countMask,flag);
}
return match;
I believe I have some how introduced extra cycles in this code and hence it is running slowly compared to the auto-vectorized code. Unlike SIMD128 which directly returns the value of the compare in 128bit register, SIMD512 returns the values in mask register which is adding more complexity to my code. Am I missing something here, there must be a way out to directly compare and keep count of successful search rather than using masks such as XOR ops.
Finally, please suggest me the ways to increase the performance of this code using intrinsics. I believe I can juice out more performance using intrinsics. This was at least true for SIMD128 where in using intrinsics allowed me to gain 25% performance.
I suggest the following optimizations:
Use prefetching. Your code performs very little computations, and almost surely bandwidth-bound. Xeon Phi has hardware prefetching only for L2 cache, so for optimal performance you need to insert prefetching instructions manually.
Use aligned read _mm512_load_epi32 as hinted by #PaulR. Use memalign function instead of malloc to guarantee that the array is really aligned on 64 bytes. And in case you will ever need misaligned instructions, use _mm512_undefined_epi32() as the source for the first misaligned load, as it breaks dependency on kV (in your current code) and lets the compiler do additional optimizations.
Unroll the array by 2 or use at least two threads to hide instruction latency.
Avoid using int variable as an index. unsigned int, size_t or ssize_t are better options.
I'm reading (in binary format) a file of unsigned 8-bit integers, which I then need to convert to an array of floats. Normally I'd just do something like the following:
uint8_t *s1_tmp = (uint8_t *)malloc(sizeof(uint8_t)*num_elements);
float *s1 = (float *)malloc(sizeof(float)*num_elements);
fread(s1_tmp, sizeof(uint8_t), num_elements, file_id);
for(int i = 0; i < num_elements; i++){
s1[i] = s1_tmp[i];
}
free(s1_tmp)
Uninspired to be sure, but it works. However, currently num_elements is around 2.7 million, so the process is super slow and IMO wasteful.
Is there a better way to read in the 8-bit integers as floats or convert the uint8_t array into a float array?
Firstly, this is going to be I/O-bound from reading the data in. Secondly, it's going to be memory-bound. You'll get much better cache performance if you interleave the conversion with the reading.
Pick some reasonable buffer size that's large enough for good I/O performance but small enough to fit in your cache, maybe 8-32 KB or so. Read in that much data, convert, and repeat.
For example:
#define BUFSIZE 16384
uint8_t *buffer = malloc(BUFSIZE);
float *s1 = malloc(num_elements * sizeof(float));
int total_read = 0;
int n;
while(total_read < num_elements && (n = fread(buffer, 1, BUFSIZE, file_id)) > 0)
{
n = min(n, num_elements - total_read);
for(int i = 0; i < n; i++)
s1[total_read + i] = (float)buffer[i];
total_read += n;
}
free(buffer);
You might also see improved performance by using SIMD operations to convert multiple items at once. However, the total performance will still be bottlenecked by the I/O from fread, so how much improvement you might see from SIMD will be questionable.
Since you're converting a large number of uint8_t values, it's all possible you could get some improved performance by using a lookup table instead of doing the integer to floating point conversion. You'd only need a lookup table of 256 float values (1 KB), which easily fits in cache. I don't know if that would be faster or not, so you should definitely profile the code to figure out what the best option is.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Decimal to Binary conversion
I need to convert a 20digits decimal to binary using C programming. How do I go about it. Really, I am finding it hard. What buffer will I create, because it will be very large, even the calculator can't compute converting 20 digits to Binary.
I need suggestions, links and possibly sample codes.
Thanks.
Do you need to convert a decimal string to a binary string or to a value?
Rule of thumb 10^3 ~= 2^10, therefore 10^20 ~= 2^70 > 64 bits (67 to be accurate).
==> A 64bit integer will not not be enough. You can you a structure with 2 64bit integers (long long in C) or even a 8bit byte for the upper part and 64 for the lower part.
Make sure the lower part is unsigned.
You will need to write code that checks for overflow on lower part and increases upper part when this happens. You will also need to use the long division algorithm once you cross the 64bit line.
What about using a library for extended precision arithmetic? try to give a look at http://gmplib.org/
I don't know if you are trying to convert a string of numerical characters into a really big int, or a really big int into a string of 1s and 0s... but in general, you'll be doing something like this:
for (i = 0; i < digits; i++)
{
bit[i] = (big_number>>i) & 1;
// or, for the other way around
// big_number |= (bit[i] << i);
}
the main problem is that there is no builtin type that can store "big_number". So you'll probably be doing it more like this...
uint8_t big_number[10]; // the big number is stored in 10 bytes.
// (uint8_t is just "unsigned char")
for (block = 0; block < 10; block++)
{
for (i = 0; i < 8; i++)
{
bit[block*8 + i] = (big_number[block]>>i) & 1;
}
}
[edit]
To read an string of numerical characters into an int (without using scanf, or atoi, etc), I would do something like this:
// Supposing I have something like char token[]="23563344324467434533";
int n = strlen(token); // number of digits.
big_number = 0;
for (int i = 0; i < n; i++)
{
big_number += (token[i] - '0') * pow(10, n-i-1);
}
That will work for reading the number, but again, the problem with this is that there is no built-in type to store big_number. You could use a float or a double, and that would get the magnitude of the number correct, but the last few decimal places would be rounded off. If you need perfect precision, you will have to use an arbitrary-precision integer. A google search turns up a few possible libraries to use for that; but I don't have much personal experience with those libraries, so I won't make a recommendation.
Really though, the data type you use depends on what you want to do with the data afterwards. Maybe you need an arbitrary-precision integer, or maybe a double would be exactly what you need; or maybe you can write your own basic data type using the technique I outlined with the blocks of uint8_t, or maybe you're better off just leaving it as a string!