I wrote this code to find the highest temperature pixel in a thermal image. I also need to know the coordinates of the pixel in the image.
void _findMax(uint16_t* image, int sz, sPixelData* returnPixel)
{
int temp = 0;
uint16_t max = image[0];
for(int i = 1; i < sz; i++)
{
if(max < image[i])
{
max=image[i];
//temp = i;
}
}
returnPixel->temperature = image[temp];
//returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
//returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
}
With the three lines commented out the function executes in about 2ms. With the lines uncommented it takes about 35ms to execute the function.
This seems very excessive seeing as the divide and modulus are only performed once after the loop.
Any suggestions on how to speed this up?
Or why it takes so long to execute compared to the divide on modulus not include?
This is executing on an ARM A9 processor running Linux.
The compiler I'm using is ARM v8 32-Bit Linux gcc compiler.
I'm using optimize -O3 and the following compile options: -march=armv7-a+neon -mcpu=cortex-a9 -mfpu=neon-fp16 -ftree-vectorize.
Your code is flawed.
Since temp is simply 0, the complier will generate machine codes that just executes returnPixel->temperature = image[0]; which gets finished in no time. There is nothing odd here.
You should modify the line to: returnPixel->temperature = max;
You could boost the performance significantly by utilizing neon. But that's another problem.
Related
i need to do a simple multiply accumulate of two signed 8 bit arrays.
This routine runs every millisecond on an ARM7 embedded device. I am trying to speed it up a bit. I have already tried optimizing and enabling vector ops.
-mtune=cortex-a15.cortex-a7 -mfpu=neon-vfpv4 -ftree-vectorize -ffast-math -mfloat-abi=hard
this helped but I am still running close to the edge.
this is the 'c' code.
for(i = 4095; i >= 0; --i)
{
accum += arr1[i]*arr2[i];
}
I am trying to use NEON intrinsics. This loop runs ~5 times faster, but I get different results. I am pretty sure I am not properly retrieving the accumulation, or it rolls over before I do. Any help/pointers is greatly appreciated. Any detailed docs would also be helpful.
for(int i = 256; i > 0; --i)
{
int8x16_t vec16a = vld1q_s8(&arr1[index]);
int8x16_t vec16b = vld1q_s8(&arr2[index]);
vec16res = vmlaq_s8(vec16res, vec16a, vec16b);
index+=16;
}
EDIT to post solution.
Thanks to tips from all!
I dropped to to 8x8 and have a fast solution
using the below code I achieved a "fast enough" time. Not as fast as the 128bit version but good enough.
I added __builtin_prefetch() for the data, and did a 10 pass avg.
Neon is substantially faster.
$ ./test 10
original code time ~ 30392nS
optimized C time ~ 8458nS
NEON elapsed time ~ 3199nS
int32_t sum = 0;
int16x8_t vecSum = vdupq_n_s16(0);
int8x8_t vec8a;
int8x8_t vec8b;
int32x4_t sum32x4;
int32x2_t sum32x2;
#pragma unroll
for (i = 512; i > 0; --i)
{
vec8a = vld1_s8(&A[index]);
vec8b = vld1_s8(&B[index]);
vecSum = vmlal_s8(vecSum,vec8a,vec8b);
index += 8;
}
sum32x4 = vaddl_s16(vget_high_s16(vecSum),vget_low_s16(vecSum));
sum32x2 = vadd_s32(vget_high_s32(sum32x4),vget_low_s32(sum32x4));
sum += vget_lane_s32(vpadd_s32(sum32x2,sum32x2),0);
Your issue is likely overflow, so you'll need to lengthen when you do your multiply-accumulate.
As you're on ARMv7, you'll want vmlal_s8.
ARMv8 A64 has vmlal_high_s8 which allows you to stay in 128-bit vectors, which will give an added speed-up.
As mentioned in comments, seeing what auto-vectorization will do with -O options / pragma unroll is very valuable, and learning from the godbolt of that. Unrolling often gives speed-ups when doing by hand as well.
Lots more valuable tips on optimization in the Arm Neon resources.
import numpy as np
array = np.random.rand(16384)
array *= 3
above python code make each element in array has 3 times multiplied value of its own.
On my Laptop, these code took 5ms
Below code is what i tried on C language.
#include <headers...>
array = make 16384 elements...;
for(int i = 0 ; i < 16384 ; ++i)
array[i] *= 3
compile command was
gcc -O2 main.cpp
it takes almost 30ms.
Is there any way i can reduce process time of this?
P.S it was my fault. I confused unit of timestamp value.
this code is faster than numpy. sorry for this question.
This sounds pretty unbelievable. For reference, I wrote a trivial (but complete) program that does roughly what you seem to be describing. I used C++ so I could use its chrono library to get (much) more precise timing than C's clock provides, but I wouldn't expect that to affect the speed at all.
#include <iostream>
#include <chrono>
#define SIZE (16384)
float array[SIZE];
int main() {
using namespace std::chrono;
for (int i = 0; i < SIZE; i++) {
array[i] = i;
}
auto start = high_resolution_clock::now();
for (int i=0; i<SIZE; i++) {
array[i] *= 3.0;
}
auto stop = high_resolution_clock::now();
std::cout << duration_cast<microseconds>(stop - start).count() << '\n';
long total = 0;
for (int i = 0; i < SIZE; i++) {
total += i;
}
std::cout << "Ignore: " << total << "\n";
}
On my machine (2.8 GHz Haswell, so probably slower than whatever you're running) this shows a time of 7 or 8 microseconds, so around 600-700 times as fast as you're getting from Python.
Adding the compiler flag to use AVX 2 instructions reduces that to 4 microseconds, or a little more than 1000 times as fast (warning: AMD processors generally don't get as much of a speed boost from using AVX 2, but if you have a reasonably new AMD processor I'd expect it to be faster than this anyway).
Bottom line: the speed you're reporting for your C code only seems to make sense if you're running the code on some sort of slow microcontroller, or maybe a really old desktop system--though it would have to be quite old to run nearly as slow as you're reporting. My immediate guess is that even a 386 would be faster than that.
When/if you have something that takes enough time to justify it, you can also use OpenMP to run a loop like this in multiple threads. I tried that, but in this case the overhead of starting up and synchronizing the threads is (quite a bit) more than running in parallel can gain, so it's a net loss.
Compiler: VS 2019 (Microsoft (R) C/C++ Optimizing Compiler Version 19.27.28919.3 for x64).
Flags: /O2b2 /GL (and part of the time, /arch:AVX2)
I'm studying for an exam and am trying to follow this problem:
I have the following C code to do some array initialisation:
int i, n = 61440;
double x[n];
for(i=0; i < n; i++) {
x[i] = 1;
}
But the following runs faster (0.5s difference in 1000 iterations):
int i, n = 61440;
double x[n];
for(i=n-1; i >= 0; i--) {
x[i] = 1;
}
I first thought that it was due to the loop accessing the n variable, thus having to do more reads (as suggested here for example: Why is iterating through an array backwards faster than forwards). But even if I change the n in the first loop to a hard coded value, or vice versa move the 0 in the bottom loop to a variable, the performance remains the same. I also tried to change the loops to only do half the work (go from 0 to < 30720, or from n-1 to >= 30720), to eliminate any special treatment of the 0 value, but the bottom loop is still faster
I assume it is because of some compiler optimisations? But everything I look up for the generated machine code suggests, that < and >= ought to be equal.
Thankful for any hints or advice! Thank you!
Edit: Makefile, for compiler details (this is part of a multi threading exercise, hence the OpenMP, though for this case it's all running on 1 core, without any OpenMP instructions in the code)
#CC = gcc
CC = /opt/rh/devtoolset-2/root/usr/bin/gcc
OMP_FLAG = -fopenmp
CFLAGS = -std=c99 -O2 -c ${OMP_FLAG}
LFLAGS = -lm
.SUFFIXES : .o .c
.c.o:
${CC} ${CFLAGS} -o $# $*.c
sblas:sblas.o
${CC} ${OMP_FLAG} -o $# $#.o ${LFLAGS}
Edit2: I redid the experiment with n * 100, getting the same results:
Forward: ~170s
Backward: ~120s
Similar to the previous values of 1.7s and 1.2s, just times 100
Edit3: Minimal Example - changes described above where all localized to the vector update method. This is the default forward version, which takes longer than the backwards version for(i = limit - 1; i >= 0; i--)
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
void vector_update(double a[], double b[], double x[], int limit);
/* SBLAS code */
void *main() {
int n = 1024*60;
int nsteps = 1000;
int k;
double a[n], b[n], x[n];
double vec_update_start;
double vec_update_time = 0;
for(k = 0; k < nsteps; k++) {
// Loop over whole program to get reasonable execution time
// (simulates a time-steping code)
vec_update_start = omp_get_wtime();
vector_update(a, b, x, n);
vec_update_time = vec_update_time + (omp_get_wtime() - vec_update_start);
}
printf( "vector update time = %f seconds \n \n", vec_update_time);
}
void vector_update(double a[], double b[], double x[] ,int limit) {
int i;
for (i = 0; i < limit; i++ ) {
x[i] = 0.0;
a[i] = 3.142;
b[i] = 3.142;
}
}
Edit4: the CPU is AMD quad-core Opteron 8378. The machine uses 4 of those, but I'm using only one on the main processor (core ID 0 in the AMD architecture)
It's not the backward iteration but the comparison with zero which causes the loop in the second case run faster.
for(i=n-1; i >= 0; i--) {
Comparison with zero can be done with a single assembly instruction whereas comparison with any other number takes multiple instructions.
The main reason is that your compiler isn't very good at optimising. In theory there's no reason that a better compiler couldn't have converted both versions of your code into the exact same machine code instead of letting one be slower.
Everything beyond that depends on what the resulting machine code is and what it's running on. This can include differences in RAM and/or CPU speeds, differences in cache behaviour, differences in hardware prefetching (and number of prefetchers), differences in instruction costs and instruction pipelining, differences in speculation, etc. Note that (in theory) this doesn't exclude the possibility that (on most computers but not on your computer) the machine code your compiler generates for forward loop is faster than the machine code it generates for backward loop (your sample size isn't large enough to be statistically significant, unless you're working on embedded systems or game consoles where all computers that run the code are identical).
I have the following few lines of code that I am trying to run in parallel
void optimized(int data_len, unsigned int * input_array, unsigned int * output_array, unsigned int * filter_list, int filter_len) {
#pragma omp parallel for
for (int j = 0; j < filter_len; j++) {
for (int i = 0; i < data_len; i++) {
if (input_array[i] == filter_list[j]) {
output_array[i] = filter_list[j];
}
}
}
}
Just putting the pragma statement has really done wonders, but I am trying to further reduce the run time of this code. I have tried many things ranging from array padding to collapsing the loops to creating tasks, but the only thing that has seemed to work thus far is loop unrolling. Does anyone have any suggestions on what I could possibly due to further speed up this code?
You are doing pure memory accessing. That is limited by the memory bandwidth of the machine.
Multi-threading is not going to help you much. gcc -O2 already provide you SSE instruction optimization. So it may not help either to use intel instruction directly. You may try to check 4 int at once because SSE support 128 register (please see https://gcc.gnu.org/onlinedocs/gcc-4.4.5/gcc/X86-Built_002din-Functions.html and google for some example) Also to reduce the amount of data helps, by using short instead of int if you can.
This function below checks to see if an integer is prime or not.
I'm running a for loop from 3 to 2147483647 (+ve limit of long int).
But this code hangs, can't find out why?
#include<time.h>
#include<stdio.h>
int isPrime1(long t)
{
long i;
if(t==1) return 0;
if(t%2==0) return 0;
for(i=3;i<t/2;i+=2)
{
if(t%i==0) return 0;
}
return 1;
}
int main()
{
long i=0;
time_t s,e;
s = time(NULL);
for(i=3; i<2147483647; i++)
{
isPrime1(i);
}
e = time(NULL);
printf("\n\t Time : %ld secs", e - s );
return 0;
}
It will eventually terminate, but will take a while, if you look at your loops when you inline your isPrime1 function, you have something like:
for(i=3; i<2147483647; i++)
for(j=3;j<i/2;j+=2)
which is roughly n*n/4 = O(n^2). Your loop trip count is way too high.
It depends upon the system and the compiler. On Linux, with GCC 4.7.2 and compiling with gcc -O2 vishaid.c -o vishaid the program returns immediately, and the compiler is optimizing all the call to isPrime1 by removing them (I checked the generated assembler code with gcc -O2 -S -fverbose-asm, then main does not even call isPrime1). And GCC is right: since isPrime1 has no side-effect and its result is not used, its call can be removed. Then the for loop has an empty body, so can also be optimized.
The lesson to learn is that when benchmarking optimized binaries, you better have some real side-effect in your code.
Also, arithmetic tells us that some i is prime if it has no divisors less than its square root. So better code:
int isPrime1(long t) {
long i;
double r = sqrt((double)t);
long m = (long)r;
if(t==1) return 0;
if(t%2==0) return 0;
for(i=3;i <= m;i +=2)
if(t%i==0) return 0;
return 1;
}
On my system (x86-64/Debian/Sid with i7 3770K Intel processor, the core running that program is at 3.5GHz) long-s are 64 bits. So I coded
int main ()
{
long i = 0;
long cnt = 0;
time_t s, e;
s = time (NULL);
for (i = 3; i < 2147483647; i++)
{
if (isPrime1 (i) && (++cnt % 4096) == 0) {
printf ("#%ld: %ld\n", cnt, i);
fflush (NULL);
}
}
e = time (NULL);
printf ("\n\t Time : %ld secs\n", e - s);
return 0;
}
and after about 4 minutes it was still printing a lot of lines, including
#6819840: 119566439
#6823936: 119642749
#6828032: 119719177
#6832128: 119795597
I'm guessing it would need several hours to complete. After 30 minutes it is still spitting (slowly)
#25698304: 486778811
#25702400: 486862511
#25706496: 486944147
#25710592: 487026971
Actually, the program needed 4 hours and 16 minutes to complete. Last outputs are
#105086976: 2147139749
#105091072: 2147227463
#105095168: 2147315671
#105099264: 2147402489
Time : 15387 secs
BTW, this program is still really inefficient: The primes program /usr/games/primes from bsdgames package is answering much quicker
% time /usr/games/primes 1 2147483647 | tail
2147483423
2147483477
2147483489
2147483497
2147483543
2147483549
2147483563
2147483579
2147483587
2147483629
/usr/games/primes 1 2147483647
10.96s user 0.26s system 99% cpu 11.257 total
and it has still printed 105097564 lines (most being skipped by tail)
If you are interested in prime number generation, read several math books (it is still a research subject if you are interested in efficiency; you still can get your PhD on that subject.). Start with the sieve of erasthothenes and primality test pages on Wikipedia.
Most importantly, compile first your program with debugging information and all warnings (i.e. gcc -Wall -g on Linux) and learn to use your debugger (i.e. gdb on Linux). You could then interrupt your debugged program (with Ctrl-C under gdb, then let it continue with the cont command to gdb) after about a minute and two, then observe that the i counter in main is increasing slowly. Perhaps also ask for profiling information (with -pg option to gcc then use gprof). And when coding complex arithmetic things it is well worth to read good math books about them (and primality test is a very complex subject, central to most cryptographic algorithms).
This is a very inefficient approach to test for primes, and that's why it seems to hang.
Search the web for more efficient algorithms, such as the Sieve of Eratosthenes
Here try this, see if it's really an infinite loop
int main()
{
long i=0;
time_t s,e;
s = time(NULL);
for(i=3; i<2147483647; i++)
{
isPrime1(i);
//calculate the time execution for each loop
e = time(NULL);
printf("\n\t Time for loop %d: %ld secs", i, e - s );
}
return 0;
}