I have the following code snippet,
int main()
{
int loop;
char * src = 0x20000000;
char * dest = 0x20000008;
for(loop = 0; loop < 8; loop++)
dest [loop] = src [loop];
}
Is this a valid code? How to optimize the logic to reduce looping?
Assuming the compiler doesn't do it automatically, it can be optimized if there a couple of assumptions in place:
Size of char is 1 byte
The CPU instruction set supports 8-byte operations (for non-64-bit platforms the implementation may vary)
Cast (or directly define) the source and destination addresses to unsigned long long*
Perform direct assignment from source to destination. If the platform instruction set supports 64-bit operations, it should result in copy of 64-bit data chunk from source to destination. E.g., on Intel CPUs it can be done with a single movsq assembly instruction.
int main()
{
unsigned long long * src = 0x20000000;
unsigned long long * dest = 0x20000008;
*dest = *src;
return 0;
}
Is this a valid code?
You can use your compiler to find that out. The loop part in particular is valid code, which I suppose is what you're really asking about.
How to optimize the logic to reduce looping?
Turn on optimization in your compiler, it will take care of the rest. There's no way to improve on that code from a performance perspective, though you could use memcpy() to make the code more concise and easier to read.
Related
I am currently programming a 8051 µC in C (with compiler: Wickehaeuser µC/51) and so that, I am thinking which way is the best, to pupulate a structure. In my current case I have a time/date structure which should be pupulated with the current Time/Date from an RTC via SFRs.
So I am thinking of the best method to do this:
Get the data via return value by creating the variable inside the function (get_foo_create)
Get data via call by reference (get_foo_by_reference)
Get via call by reference plus returning it (by writing I think this is stupid, but I am also thinking about this) (get_foo_by_reference)
The following code is just an example (note: there is currently a failure in the last print, which does not print out the value atm)
Which is the best method?
typedef struct {
unsigned char foo;
unsigned char bar;
unsigned char baz;
}data_foo;
data_foo get_foo_create(void) {
data_foo foo;
foo.bar = 2;
return foo;
}
void get_foo_by_reference(data_foo *foo) {
// Read values e.g. from SFR
foo->bar = 42; // Just simulate SFR
}
data_foo *get_foo_pointer_return(data_foo *foo) {
// Read values e.g. from SFR
(*foo).bar = 11; // Just simulate SFR
return foo;
}
/**
* Main program
*/
void main(void) {
data_foo struct_foo;
data_foo *ptr_foo;
seri_init(); // Serial Com init
clear_screen();
struct_foo = get_foo_create();
printf("%d\n", struct_foo.bar);
get_foo_by_reference(&struct_foo);
printf("%d\n", struct_foo.bar);
ptr_foo = get_foo_pointer_return(&ptr_foo);
//Temp problem also here, got 39 instead 11, tried also
//printf("%d\n",(void*)(*ptr_foo).bar);
printf("%d\n",(*ptr_foo).bar);
SYSTEM_HALT; //Programm end
}
On the 8051, you should avoid using pointers to the extent possible. Instead, it's generally best--if you can afford it--to have some global structures which will be operated upon by various functions. Having functions for "load thing from address" and "store thing to address", along with various functions that manipulate thing, can be much more efficient than trying to have functions that can operate on objects of that type "in place".
For your particular situation, I'd suggest having a global structure called "time", as well as a global union called "ldiv_acc" which combines a uint_32, two uint16_t, and four uint8_t. I'd also suggest having an "ldivmod" function which divides the 32-bit value in ldiv_acc by an 8-bit argument, leaving the quotient in ldiv_acc and returning the remainder, as well as an "lmul" function which multiplies the 32-bit value in ldiv_acc by an 8-bit value. It's been a long time since I've programmed the 8051, so I'm not sure what help compilers need to generate good code, but 32x32 divisions and multiplies are going to be expensive compared with using a combination of 8x8 multiplies and divides.
On the 8051, code like:
uint32_t time;
uint32_t sec,min,hr;
sec = time % 60;
time /= 60;
min = time % 60;
time /= 60;
hr = time % 24;
time /= 24;
is likely to be big and slow. Using something like:
ldiv_acc.l = time;
sec = ldivmod(60);
min = ldivmod(60);
hr = ldivmod(24);
is apt to be much more compact and, if you're clever, faster. If speed is really important, you could use functions to perform divmod6, divmod10, divmod24, and divmod60, taking advantage of the fact that divmod60(h+256*l) is equal to h*4+divmod60(h*16+l). The second addition might yield a value greater than 256, but if it does, applying the same technique will get the operand below 256. Dividing an unsigned char by another unsigned char is faster than divisions involving unsigned int.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
My project is to scan an address space (which in my case is 0x00000000 - 0xffffffff, or 0 - (232)-1) for a pattern and return in an array the locations in memory where the pattern was found (could be found multiple times).
Since the address space is 32 bits, i is a double and max is pow(2,32) (also a double).
I want to keep the original value of i intact so that I can use that to report the location of where the pattern was found (since actually finding the pattern requires moving forward several bytes past i), so I want temp, declared as char *, to copy the value of i. Then, later in my program, I will dereference temp.
double i, max = pow(2, 32);
char *temp;
for (i = 0; i < max; i++)
{
temp = (char *) i;
//some code involving *temp
}
The issue I'm running into is a double can't be cast as a char *. An int can be; however, since the address space is 32 bits (not 16) I need a double, which is exactly large enough to represent 2^32.
Is there anything I can do about this?
In C, double and float are not represented the way you think they are; this code demonstrates that:
#include <stdio.h>
typedef union _DI
{
double d;
int i;
} DI;
int main()
{
DI di;
di.d = 3.00;
printf("%d\n", di.i);
return 0;
}
You will not see an output of 3 in this case.
In general, even if you could read other process' memory, your strategy is not going to work on any modern operating system because of virtual memory (the address space that one process "sees" doesn't necessarily (in fact, it usually doesn't) represent the physical memory on the system).
Never use a floating point variable to store an integer. Floating point variables make approximate computations. It would happen to work in this case, because the integers are small enough, but to know that, you need intimate knowledge of how floating point works on a particular machine/compiler and what range of integers you'll be using. Plus it's harder to write the program, and the program would be slower.
C defines an integer type that's large enough to store a pointer: uintptr_t. You can cast a pointer to uintptr_t and back. On a 32-bit machine, uintptr_t will be a 32-bit type, so it's only able to store values up to 232-1. To express a loop that covers the whole range of the type including the first and last value, you can't use an ordinary for loop with a variable that's incremented, because the ending condition requires a value of the loop index that's out of range. If you naively write
uintptr_t i;
for (i = 0; i <= UINTPTR_MAX; i++) {
unsigned char *temp = (unsigned char *)i;
// ...
}
then you get an infinite loop, because after the iteration with i equal to UINTPTR_MAX, running i++ wraps the value of i to 0. The fact that the loop is infinite can also be seen in a simpler logical way: the condition i <= UINTPTR_MAX is always true since all values of the type are less or equal to the maximum.
You can fix this by putting the test near the end of the loop, before incrementing the variable.
i = 0;
do {
unsigned char *temp = (unsigned char *)i;
// ...
if (i == UINTPTR_MAX) break;
i++;
} while (1);
Note that exploring 4GB in this way will be extremely slow, if you can even do it. You'll get a segmentation fault whenever you try to access an address that isn't mapped. You can handle the segfault with a signal handler, but that's tricky and slow. What you're attempting may or may not be what your teacher expects, but it doesn't make any practical sense.
To explore a process's memory on Linux, read /proc/self/maps to discover its memory mappings. See my answer on Unix.SE for some sample code in Python.
Note also that if you're looking for a pattern, you need to take the length of the whole pattern into account, a byte-by-byte lookup doesn't do the whole job.
Ahh, a school assignment. OK then.
uint32_t i;
for ( i = 0; i < 0xFFFFFFFF; i++ )
{
char *x = (char *)i;
// Do magic here.
}
// Also, the above code skips on 0xFFFFFFFF itself, so magic that one address here.
// But if your pattern is longer than 1 byte, then it's not necessary
// (in fact, use something less than 0xFFFFFFFF in the above loop then)
The cast of a double to a pointer is a constraint violation - hence the error.
A floating type shall not be converted to any pointer type. C11dr §6.5.4 4
To scan the entire 32-bit address space, use a do loop with an integer type capable of the [0 ... 0xFFFFFFFF] range.
uint32_t address = 0;
do {
char *p = (char *) address;
foo(p);
} while (address++ < 0xFFFFFFFF);
For very simple iterators or iterative loops over a range of memory the following two methods can be used (code in simple C to resemble the underlying machine instructions):
Counter:
int a[10]; int *p=a; cnt = 10;
do{
foo(*p++); /* loop action */
cnt--;
} while( cnt > 0);
Pointer:
int a[10]; int *p=a; int *stop=a+10;
do{
foo(*p++); /* loop action */
} while( p < stop);
Although the latter version seems to be one instruction less, on most machines that I have in my dated memory there is a "decrement register and jump if not zero" instruction which is about as fast (or even faster) than the pointer compare - on architectures with 64 bit pointers and 32-bit data objects even more so. Which of the two versions will be faster on AMD64 and on ARM/ARM64? Is there a compare-pointer-and-branch-if-less instruction?
Profile your application and see if different loops make a difference.
In modern CPUs the bottleneck is memory access latency, rather then instruction throughput. What often makes a real difference is optimizing memory access to avoid CPU cache misses.
I am working in a space limited environment. I collect an array of unsigned 32 bit ints via DMA, but I need to work on them as single precision floats using DSP extensions in the MCU. Copying the array is not possible - it takes up almost all existing SRAM. Is there a neat way to do this?
[Note] The data values are only 12 bits so out of range problems will not exist
You can just do it like this:
uint32_t a[N];
float *f = (float *)a;
for (i = 0; i < N; ++i)
{
f[i] = (float)a[i];
}
Note that this breaks strict aliasing rules so you should compile with -fno-strict-aliasing or equivalent.
I am trying to exploit the SIMD 512 offered by knc (Xeon Phi) to improve performance of the below C code using intel intrinsics. However, my intrinsic embedded code runs slower than auto-vectorized code
C Code
int64_t match=0;
int *myArray __attribute__((align(64)));
myArray = (int*) malloc (sizeof(int)*SIZE); //SIZE is array size taken from user
radomize(myArray); //to fill some random data
int searchVal=24;
#pragma vector always
for(int i=0;i<SIZE;i++) {
if (myArray[i]==searchVal) match++;
return match;
Intrinsic embedded code:
In the below code I am first loading the array and comparing it with search key. Intrinsics return 16bit mask values that is reduced using _mm512_mask_reduce_add_epi32().
register int64_t match=0;
int *myArray __attribute__((align(64)));
myArray = (int*) malloc (sizeof(int)*SIZE); //SIZE is array size taken from user
const int values[16]=\
{ 1,1,1,1,\
1,1,1,1,\
1,1,1,1,\
1,1,1,1,\
};
__m512i const flag = _mm512_load_epi32((void*) values);
__mmask16 countMask;
__m512i searchVal = _mm512_set1_epi32(16);
__m512i kV = _mm512_setzero_epi32();
for (int i=0;i<SIZE;i+=16)
{
// kV = _mm512_setzero_epi32();
kV = _mm512_loadunpacklo_epi32(kV,(void* )(&myArray[i]));
kV = _mm512_loadunpackhi_epi32(kV,(void* )(&myArray[i + 16]));
countMask = _mm512_cmpeq_epi32_mask(kV, searchVal);
match += _mm512_mask_reduce_add_epi32(countMask,flag);
}
return match;
I believe I have some how introduced extra cycles in this code and hence it is running slowly compared to the auto-vectorized code. Unlike SIMD128 which directly returns the value of the compare in 128bit register, SIMD512 returns the values in mask register which is adding more complexity to my code. Am I missing something here, there must be a way out to directly compare and keep count of successful search rather than using masks such as XOR ops.
Finally, please suggest me the ways to increase the performance of this code using intrinsics. I believe I can juice out more performance using intrinsics. This was at least true for SIMD128 where in using intrinsics allowed me to gain 25% performance.
I suggest the following optimizations:
Use prefetching. Your code performs very little computations, and almost surely bandwidth-bound. Xeon Phi has hardware prefetching only for L2 cache, so for optimal performance you need to insert prefetching instructions manually.
Use aligned read _mm512_load_epi32 as hinted by #PaulR. Use memalign function instead of malloc to guarantee that the array is really aligned on 64 bytes. And in case you will ever need misaligned instructions, use _mm512_undefined_epi32() as the source for the first misaligned load, as it breaks dependency on kV (in your current code) and lets the compiler do additional optimizations.
Unroll the array by 2 or use at least two threads to hide instruction latency.
Avoid using int variable as an index. unsigned int, size_t or ssize_t are better options.