Efficient way to scan struct nested arrays - c

I have a struct that has several arrays members:
typedef myData someStruct {
uint16_t array1 [ARRAY_LENGTH]
uint16_t array2 [ARRAY_LENGTH]
} myData;
myData testData = {0}; // Global struct
At some point in my program I need to set the arrays to some set of predefined values, e.g., set array1 to all 0, array2 to all 0xFF, etc. My first instinct was to write out a for loop something like:
void someFunction (myData * test) {
for (uint16_t i = 0; i < ARRAY_LENGTH; ++i) {
test->array1[i] = 0xFF;
test->array2[i] = 0xCC;
}
}
However I then reasoned that the actions required by the program to do this would go something like:
load address of array1 first position
set value 0xFF;
load far address of array2 first postion
set value 0xCC;
load far address of array1 second position
set value 0xFF;
// and so on...
Whereas if I used a separate loop for each array the addresses would be a lot nearer each other (as arrays and structs stored contiguously), so the address loads are only to the next byte each time, making the code actually more efficient as follows:
void someFunction (myData * test) {
uint16_t i = 0;
for (i; i < ARRAY_LENGTH; ++i)
test->array1[i] = 0xFF;
for (i = 0; i < ARRAY_LENGTH; ++i)
test->array2[i] = 0xCC;
}
Is my reasoning correct, is the second one better? Furthermore, would a compiler (say gcc, for e.g.) normally be able to make this optimization itself?

It's going to depend on your system architecture. For example, on, say, a SPARC system, the cache line size is 64-bytes, and there are enough cache slots for both arrays, so the first version would be efficient. The load of the first array element would populate the cache, and subsequent loads would be very fast. If the compiler is smart enough, it can use prefetch as well.
On ISAs that support offset addressing, it doesn't actually fetch the address of the array element each time, it just increments an offset. So it only fetches the base address of the array, once, and then uses a load instruction with the base and offset. Each time through the loop it increments the offset in a register. Some instruction sets even have auto-increment.
The best thing to do would be to write a sample program/function, and try it. Optimizations at this low a level require either a thorough knowledge of the CPu/system, or lots of trial and error.

My humble recommendation: try and see. One loop solution saves arithmetic operations around increment and test of i. Two loops will probably profit of better cache optimization, especially if arrays are aligned to memory pages. In such case each access may cause a cache miss and cache reload. Personally, if the speed really matters I would prefer two loops with some unfolding.

Related

Make all pointers in an array of pointers point to the same thing in C?

I have these two definitions:
uint8_t *idx[0x100];
uint8_t raw[0x1000];
Is there any other way than to loop over every element of idx to point them all to raw[0]?
for (i=0; i<sizeof(raw); i++)
idx[i] = &raw[0];
There must be a faster way than ↑ that. Is there an equivalent to memset for pointers?
The simple, straightforward loop is probably the best way (note that there's an error in your current loop as others pointed out).
The advantage is that those kind of loops are very easy to optimize, it's such a common case that compilers have gotten very good at it, and your compiler will use vector instructions and other optimizations as needed to keep it very fast without needing to hand-optimize yourself.
And of course at the same time it is more readable, more maintainable, than optimizing it by hand.
Of course if there's a special case, for example if you want to fill it with null pointers, or if you know what the content will be at compile time, then there are some slightly more efficient ways to do that, but in the general case making it easy for your compiler to optimize your code is the simplest way to get good performance.
from performance engineering's perspective, there is indeed a way to make it faster than
for (i=0; i<sizeof(raw); i++)
idx[i] = &raw[0];
if you make a comparison after turning off the optimizer in compiler. but the difference could be very minor.
let's do it:
uint8_t *idx[0x100];
uint8_t raw[0x1000];
#define lengthof(arr) (sizeof(arr) / sizeof(*arr))
uint8_t *start = idx;
int length = lengthof(idx);
uint8_t *end = idx + (length & ~1);
for (; start < end;)
{
*start++ = raw;
*start++ = raw;
}
if (length & 1)
*start++ = raw;
this is faster majorly because of two reasons:
direct operate on pointers. if you do idx[i], in assembly, (idx + i * sizeof *idx) will be perform each time, while *start has already had the answer in hand.
duplicate operation in each iteration. in this way, the code will have less branching while maintaining the locality. gcc -O2 mostly likely will do the trick for you.
We only see a fragment of code, if you are initializing a global array of pointers to point to a global array of uint8_t, there is a faster way: write an explicit initializer. The initalization is done at compile time and takes virtually no time at execution time.
If the array is automatic, I'm afraid there is no faster way to do this. If your compiler is clever and instructed to use optimizations (-O2, -O3, etc.) it will probably unroll the loop and generate pretty efficient code. Look at the assembly output to verify this. If it does not, you can unroll the loop yourself:
Assuming the array size is a multiple of 4:
for (i = 0; i < sizeof(idx) / sizeof(*idx); i += 4)
idx[i] = idx[i+1] = idx[i+2] = idx[i+3] = &raw[0];
Note that you should be careful with the sizeof operator: in addition to using the wrong array for the size computation, your code makes 2 implicit assumptions:
The array element is a char
idx is an array, not a pointer to an array.
It is advisable to use sizeof(idx) / sizeof(*idx) to compute the number of elements of the array: this expression works for all array element types, but idx still needs to be an array type. Defining a macro:
#define countof(a) (sizeof(a) / sizeof(*(a)))
Makes it more convenient, but hides the problem if a is a pointer.

Most efficient way to check if elements in an array have changed

I have an array in c and I need to perform some operation only if the elements in an array have changed. However the time and memory taken for this is very important. I realized that an efficient way to do this would probably be to hash all the elements of the array and compare the result with the previous result. If they match that means the elements dont change. I would however like to know if this is the most efficient way of doing things. Also since the array is only 8 bytes long(1 byte for each element) which hashing function would be least time consuming?
The elements in an array are actually being received from another microcontroller. So they may or may not change depending on whether what the other micro-controller measured is the same or not
If you weren't tied to a simple array, you could create a "MRU" List of structures where the structure could contain a flag that indicates if the item was changed since it was last inspected.
Every time an item changes set the "changed flag" and move it to the head of the list. When you need to check for the changed items you traverse the list from the head and unset the changed flags and stopping at the first element with its change flag not set.
Sorry, I missed the part about the array being only 8 bytes long. With that info and with the new info from your edit, I'm thinking the previous suggestion is not ideal.
If the array is only 8-bytes long why not just cache a copy of the previous array and compare it to the new array received?
Below is a clarification of my comment about "shortcutting" the compares. How you implement this would depend on what the sizeof(int) is on the platform used.
Using a 64-bit integer you could get away with one compare to determine if the array has changed. For example:
#define ARR_SIZE 8
unsigned char cachedArr[ARR_SIZE];
unsigned char targetArr[ARR_SIZE];
unsigned int *ic = (unsigned int *)cachedArr;
unsigned int *it = (unsigned int *)targetArr;
// This assertion needs to be true for this implementation to work
// correctly.
assert(sizeof(int) == sizeof(cachedArr));
/*
** ...
** assume initialization and other suff here
** leading into the main loop that is receiving the target array data.
** ...
*/
if (*ic != *it)
{
// Target array has changed; find out which element(s) changed.
// If you only cared that there was a change and did not care
// to know which specific element(s) had changed you could forego
// this loop altogether.
for (int i = 0; i < ARR_SIZE; i++)
{
if (cachedArr[i] != targetArr[i])
{
// Do whatever needs to be done based on the i'th element
// changed
}
}
// Cache the array again since it has changed.
memcpy(cachedArr, targetArr, sizeof(cachedArr));
}
// else no change to the array
If the native integer size was smaller than 64-bit you could use the same theory, but you'd have to loop over the array sizeof(cachedArr) / sizeof(unsigned int) times; and there would be a worst-case scenario involved (but isn't there always) if the change was in the last chunk tested.
It should be noted that with doing any char to integer type casting you may need to take into consideration alignment (if the char data is aligned to the appropriate word-size boundary).
Thinking further upon this however, it might be better altogether to just unroll the loop yourself and do:
if (cachedArr[0] != targetArr[0])
{
doElement0ChangedWork();
}
if (cachedArr[1] != targetArr[1])
{
doElement1ChangedWork();
}
if (cachedArr[2] != targetArr[2])
{
doElement2ChangedWork();
}
if (cachedArr[3] != targetArr[3])
{
doElement3ChangedWork();
}
if (cachedArr[4] != targetArr[4])
{
doElement4ChangedWork();
}
if (cachedArr[5] != targetArr[5])
{
doElement5ChangedWork();
}
if (cachedArr[6] != targetArr[6])
{
doElement6ChangedWork();
}
if (cachedArr[7] != targetArr[7])
{
doElement7ChangedWork();
}
Again, depending on whether or not knowing which specific element(s) changed that could be tightened up. This would result in more instruction memory needed but eliminates the loop overhead (the good old memory versus speed trade-off).
As with anything time/memory related test, measure, compare, tweak and repeat until desired results are achieved.
only if the elements in an array have changed
Who else but you is going to change them? You can just keep track of whether you've made a change since the last time you did the operation.
If you don't want to do that (perhaps because it'd require recording changes in too many places, or because the record-keeping would take too much time, or because another thread or other hardware is messing with the array), just save the old contents of the array in a separate array. It's only 8 bytes. When you want to see whether anything has changed, compare the current array to the copy element-by-element.
As others have said, the elements will only change if the code changed them.
Maybe this data can be changed by another user? Otherwise you would know that you had changed an entry.
As far as the hash function, there are only 2^8 = 256 different values that this array can take. A hash function won't really help here. Also, a hash function has to be computed, which costs memory so I don't think that will work for your application.
I would just compare bits until you find one has changed. If one has changed, the you will check 4 bits on average before you that your array has changed (assuming that each bit is equally likely to change).
If one hasn't changed, that is worst case scenario and you will have to check all eight bits to conclude that none have changed.
If array only 8 bytes long, you can treat it as if it is a long long type number. Suppose original array is char data[8].
long long * pData = (logn long *)data;
long long olddata = *pData;
if ( olddata != *pData )
{
// detect which one changed
}
I mean, this way you operate all data in one shot, this is much faster than access each element using index. hash is slower n this case.
If it is byte oriented with only eight elements, doing an XOR function would be more efficient than any other comparison.
If ((LocalArray[0] ^ received Array [0]) & (LocalArray[1] ^ received Array [1]) & ...)
{
//Yes it is changed
}

Runtime of Initializing an Array Zero-Filled

If I were to define the following array using the zero-fill initialization syntax on the stack:
int arr[ 10 ] = { 0 };
... is the run time constant or linear?
My assumption is that it's a linear run time -- my assumption is only targeting the fact that calloc must go over every byte to zero-fill it.
If you could also provide a why and not just it's order xxx that would be tremendous!
The runtime is linear in the array size.
To see why, here's a sample implementation of memset, which initializes an array to an arbitrary value. At the assembly-language level, this is no different than what goes on in your code.
void *memset(void *dst, int val, size_t count) {
unsigned char *start = dst;
for (size_t i = 0; i < count; i++)
*start++ = value;
return dst;
}
Of course, compilers will often use intrinsics to set multiple array elements at a time. Depending on the size of the array and things like alignment and padding, this might make the runtime over array length more like a staircase, with the step size based on the vector length. Over small differences in array size, this would effectively make the runtime constant, but the general pattern is still linear.
This is actually a tip of the ice berg question. What you are really asking is what is the order (Big Oh) of initializing an array. Essentially, the code is looping thru each element of the array and setting them to zero. You could write a for loop to do the same thing.
The Order of magnitude of that loop is O(n), that is, the time spent in the loop increases in proportion to the number of elements being initialized.
If the hardware supported an instruction that says to set all bytes from location X to Y to zero and that instruction worked in M instruction cycles and M never changed regardless of the number of bytes being set to zero, then that would be of order k, or O(k).
In general, O(k) is probably referred to as constant time and O(n) as linear.

C cache optimization for direct mapped cache

Having some trouble figuring out the hit and miss rates of the following two snippets of code.
Given info: we have a 1024 Byte direct-mapped cache with block sizes of 16 bytes. So that makes 64 lines (sets in this case) then. Assume the cache starts empty. Consider the following code:
struct pos {
int x;
int y;
};
struct pos grid[16][16];
int total_x = 0; int total_y = 0;
void function1() {
int i, j;
for (i = 0; i < 16; i++) {
for (j = 0; j < 16; j++) {
total_x += grid[j][i].x;
total_y += grid[j][i].y;
}
}
}
void function2() {
int i, j;
for (i = 0; i < 16; i++) {
for (j = 0; j < 16; j++) {
total_x += grid[i][j].x;
total_y += grid[i][j].y;
}
}
}
I can tell from some basic rules (i.e. C arrays are row-major order) that function2 should be better. But I don't understand how to calculate the hit/miss percentages. Apparently function1() misses 50% of the time, while function2() only misses 25% of the time.
Could somebody walk me through how those calculations work? All I can really see is that no more than half the grid will ever fit inside the cache at once. Also, is this concept easy to extend to k-way associative caches?
Thanks.
How data are stored in memory
Every structure pos has a size of 8 Bytes, thus the total size of pos[16][16] is 2048 Bytes. And the order of the array are as follows:
pos[0][0] pos[0][1] pos[0][2] ...... pos[0][15] pos[1]0[] ...... pos[1][15].......pos[15][0] ......pos[15][15]
The cache organization compared to the data
For the cache, each block is 16 Bytes, which is the same size as two elements of the array. The Entire cache is 1024 Bytes, which is half the size of the entire array. Since cache is direct-mapped, that means if we label the cache block from 0 to 63, we can safely assume that the mapping should look like this
------------ memory----------------------------cache
pos[0][0] pos[0][1] -----------> block 0
pos[0][2] pos[0][3] -----------> block 1
pos[0][4] pos[0][5] -----------> block 2
pos[0][14] pos[0][15] --------> block 7
.......
pos[1][0] pos[1][1] -----------> block 8
pos[1][2] pos[1][3] -----------> block 9
.......
pos[7][14] pos[7][15] --------> block 63
pos[8][0] pos[8][1] -----------> block 0
.......
pos[15][14] pos[15][15] -----> block 63
How function1 manipulates memory
The loop follows a column-wise inner loop, that means the first iteration loads pos[0][0] and pos[0][1] to cache block 0, the second iteration loads pos[1][0] and pos[1][1] to cache block 8. Caches are cold, so the first column x is always miss, while y is always hit. The second column data are supposedly all loaded in cache during the first column access, but this is NOT the case. Since pos[8][0] access has already evict the former pos[0][0] page(they both map to block 0!).So on, the miss rate is 50%.
How function2 manipulates memory
The second function has nice stride-1 access pattern. That means when accessing pos[0][0].x pos[0][0].y pos[0][1].x pos[0][1].y only the first one is a miss due to the cold cache. The following patterns are all the same. So the miss rate is only 25%.
K-way associative cache follows the same analysis, although that may be more tedious. For getting the most out of the cache system, try to initiate a nice access pattern, say stride-1, and use the data as much as possible during each loading from memory. Real world cpu microarchitecture employs other intelligent design and algorithm to enhance the efficiency. The best method is always to measure the time in real world, dump the core code, and do a thorough analysis.
Ok, my computer science lectures are a bit far off but I think I figured it out (it's actually a very easy example when you think about it).
Your struct is 8 byte long (2 x 4). Since your cache blocks are 16 bytes, a memory access grid[i][j] will fetch exactly two struct entries (grid[i][j] and grid[i][j+1]). Therefore, if you loop through the second index only every 4th access will lead to a memory read. If you loop through the first index, you probably throw away the second entry that has been fetched, that depends on the number of fetches in the inner loop vs. the overall cache-size though.
Now we have to think about the cache size as well: You say that you have 64 lines that are directly mapped. In function 1, an inner loop is 16 fetches. That means, the 17th fetch you get to grid[j][i+1]. This should actually be a hit, since it should have been kept in the cache since the last inner loop walk. Every second inner loop should therefore only consist of hits.
Well, if my reasonings are correct, the answer that has been given to you should be wrong. Both functions should perform with 25% misses. Maybe someone finds a better answer but if you understand my reasoning I'd ask a TA about that.
Edit: Thinking about it again, we should first define what actually qualifies as a miss/hit. When you look at
total_x += grid[j][i].x;
total_y += grid[j][i].y;
are these defined as two memory accesses or one? A decent compiler with optimization settings should optimize this to
pos temp = grid[j][i];
total_x += temp.x;
total_y += temp.y;
which could be counted as one memory access. I therefore propose the universal answer to all CS questions: "It depends."

Embedded C - How to create a cache for expensive external reads?

I am working with a microcontroller that has an external EEPROM containing tables of information.
There is a large amount of information, however there is a good chance that we will request the same information cycle to cycle if we are fairly 'stable' - i.e. if we are at a constant temperature for example.
Reads from the EEPROM take around 1ms, and we do around 30 per cycle. Our cycle is currently about 100ms so there is significant savings to be had.
I am therefore looking at implementing a RAM cache. A hit should be significantly faster than 1ms since the microcontroller core is running at 8Mhz.
The lookup involves a 16-bit address returning 16-bit data. The microcontroller is 32-bit.
Any input on caching would be greatly appreciated, especially if I am totally missing the mark and should be using something else, like a linked list, or even a pre-existing library.
Here is what I think I am trying to achieve:
-A cache made up of an array of structs. The struct would contain the address, data and some sort of counter indicating how often this piece of data has been accessed (readCount).
-The array would be sorted by address normally. I would have an efficient lookup() function to lookup an address and get the data (suggestions?)
-If I got a cache miss, I would sort the array by readCount to determine the least used cached value and throw it away. I would then fill its position with the new value I have looked up from EEPROM. I would then reorder the array by address. Any sorting would use an efficient sort (shell sort? - not sure how to handle this with arrays)
-I would somehow decrement all of the readCount variables to that they would tend to zero if not used. This should preserve constantly used variables.
Here are my thoughts so far (pseudocode, apologies for my coding style):
#define CACHE_SIZE 50
//one piece of data in the cache
struct cacheItem
{
uint16_t address;
uint16_t data;
uint8_t readCount;
};
//array of cached addresses
struct cacheItem cache[CACHE_SIZE];
//function to get data from the cache
uint16_t getDataFromCache(uint16_t address)
{
uint8_t cacheResult;
struct cacheItem * cacheHit; //Pointer to a successful cache hit
//returns CACHE_HIT if in the cache, else returns CACHE_MISS
cacheResult = lookUpCache(address, cacheHit);
if(cacheResult == CACHE_MISS)
{
//Think this is necessary to easily weed out the least accessed address
sortCacheByReadCount();//shell sort?
removeLastCacheEntry(); //delete the last item that hasn't been accessed for a while
data = getDataFromEEPROM(address); //Expensive EEPROM read
//Add on to the bottom of the cache
appendToCache(address, data, 1); //1 = setting readCount to 1 for new addition
//Think this is necessary to make a lookup function faster
sortCacheByAddress(); //shell sort?
}
else
{
data = cacheHit->data; //We had a hit, so pull the data
cacheHit->readCount++; //Up the importance now
}
return data;
}
//Main function
main(void)
{
testData = getDataFromCache(1234);
}
Am I going down the completely wrong track here? Any input is appreciated.
Repeated sorting sounds expensive to me. I would implement the cache as a hash table on the address. To keep things simple, I would start by not even counting hits but rather evicting old entries immediately on seeing a hash collision:
const int CACHE_SIZE=32; // power of two
struct CacheEntry {
int16_t address;
int16_t value
};
CacheEntry cache[CACHE_SIZE];
// adjust shifts for different CACHE_SIZE
inline int cacheIndex(int adr) { return (((adr>>10)+(adr>>5)+adr)&(CACHE_SIZE-1)); }
int16_t cachedRead( int16_t address )
{
int idx = cacheIndex( address );
CacheEntry * pCache = cache+idx;
if( address != pCache->address ) {
pCache->value = readEeprom( address );
pCache->address = address;
}
return pCache->value
}
If this proves not effective enough, I would start by fiddling around with the hash function.
Don't be afraid to do more computations, in most cases I/O is slower.
This is the simpliest implementation I can think of:
#define CACHE_SIZE 50
something cached_vals[CACHE_SIZE];
short int cached_item_num[CACHE_SIZE];
char cache_hits[CACHE_SIZE]; // 0 means free.
void inc_hits(char index){
if (cache_hits[index] > 127){
for (int i = 0; i < CACHE_SIZE; i++)
cache_hits[i] <<= 1;
cache_hits[i]++; // 0 is reserved as "free" marker
};
cache_hits[index]++;
}:
int get_new_space(short int item){
for (int i = 0; i < CACHE_SIZE; i++)
if (!cache_hits[i]) {
inc_hits(i);
return i;
};
// no free values, dropping the one with lowest count
int min_val = 0;
for (int i = 1; i < CACHE_SIZE; i++)
min_val = min(cache_hits[min_val], cache_hits[i]);
cache_hits[min_val] = 2; // just to give new values more chanches to "survive"
cached_item_num[min_val] = item;
return min_val;
};
something* get_item(short int item){
for (int i = 0; i < CACHE_SIZE; i++){
if (cached_item_num[i] == item){
inc_hits(i);
return cached_vals + i;
};
};
int new_item = get_new_space(item);
read_from_eeprom(item, cached_vals + new_item);
return chached_vals + new_item;
};
Sorting and moving data seems like a bad idea, and it's not clear you gain anything useful from it.
I'd suggest a much simpler approach. Allocate 4*N (for some N) bytes of data, as an array of 4-byte structs each containing an address and the data. To look up a value at address A, you look at the struct at index A mod N; if its stored address is the one you want, then use the associated data, otherwise look up the data off the EEPROM and store it there along with address A. Simple, easy to implement, easy to test, and easy to understand and debug later.
If the location of your current lookup tends to be near the location of previous lookups, that should work quite well -- any time you're evicting data, it's going to be from at least N locations away in the table, which means you're probably not likely to want it again any time soon -- I'd guess that's at least as good a heuristic as "how many times did I recently use this". (If your EEPROM is storing several different tables of data, you could probably just do a cache for each one as the simplest way to avoid collisions there.)
You said that which entry you need from the table relates to the temperature, and that the temperature tends to remain stable. As long as the temperature does not change too quickly then it is unlikely that you will need an entry from the table which more than 1 entry away from the previously needed entry.
You should be able to accomplish your goal by keeping just 3 entries in RAM. The first entry is the one you just used. The next entry is the one corresponding to the temperature just below the last temperature measurement, and the other one is the temperature just above the last temperature measurement. When the temperature changes one of these entries probably becomes the new current one. You can then preform whatever task it is you need using this data, and then go ahead and read the entry you need (higher or lower than the current temperature) after you have finished other work (before reading the next temperature measure).
Since there are only 3 entries in RAM at a time you don't have to be clever about what data structure you need to store them in to access them efficiently, or even keeping them sorted because it will never be that long.
If temperatures can move faster than 1 unit per examination period then you could just increase the size of your cache and maybe have a few more anticipatory entries (in the direction that temperature seems to be heading) than you do trailing entries. Then you may want to store the entries in an efficient structure, though. I wouldn't worry about how recently you accessed an entry, though, because next temperature probability distribution predictions based on current temperature will usually be pretty good. You will need to make sure you handle the case where you are way off and need to read in the entry for a just read temperature immediately, though.
There are my suggestions:
Replace oldest, or replace least recent policy would be better, as reolacing least accessed would quickly fill up cache and then just repeatedly replace last element.
Do not traverse all array, but take some pseudo-random (seeded by address) location to replace. (special case of single location is already presented by #ruslik).
My idea would be:
#define CACHE_SIZE 50
//one piece of data in the cache
struct cacheItem
{
uint16_t address;
uint16_t data;
uint8_t whenWritten;
};
//array of cached addresses
struct cacheItem cache[CACHE_SIZE];
// curcular cache write counter
unit8_t writecount = 0;
// this suggest cache location either contains actual data or to be rewritten;
struct cacheItem *cacheLocation(uint16_t address) {
struct cacheLocation *bestc, *c;
int bestage = -1, age, i;
srand(address); // i'll use standard PRNG to acquire locations; as it initialized
// it will always give same sequence for same location
for(i = 0; i<4; i++) { // any number of iterations you find best
c = &(cache[rand()%CACHE_SIZE]);
if(c->address == address) return c; // FOUND!
age = (writecount - whenWritten) & 0xFF; // after age 255 comes age 0 :(
if(age > bestage) {
bestage = age;
bestc = c;
}
}
return c;
}
....
struct cacheItem *c = cacheLocation(addr);
if(c->address != addr) {
c->address = addr;
c->data = external_read(addr);
c->whenWritten = ++writecount;
}
cache age will wrap after 255 to 0 but but it's hust slightly randomizes cache replacements, so it did not make workaround.

Resources