size of cache and calculating cache set

size of cache and calculating cache set - c

I'm trying to understand cache basics.
If I have
#define OFFSET_BITS (6) // 64 bytes cache line
#define SET_INDEX_BITS (5) // 32 sets
#define TAG_BITS (64 - OFFSET_BITS - SET_INDEX_BITS) //
#define NWAYS (8) // 8 ways cache.
What is the size of cache in this machine?
Is it just adding the offset, set and tag bits?
Also, lets say I have an address 0x40000100, what is the cache set for the address? How do I calculate that?

Assume you have an array, like this:
uint8_t myCache[1 << SET_INDEX_BITS][NWAYS][1 << OFFSET_BITS];
For NWAYS = 8, SET_INDEX_BITS = 5 and OFFSET_BITS = 6; the size of the array (the size of the cache) would be 16384 bytes (16 KiB).
Note that "cache size" only refers to how much data the cache can store, and excludes the cost of storing the tags needed to find the data.
The tags could be representing by a second array, like this:
myCacheTags[1 << SET_INDEX_BITS][NWAYS];
If one tag costs 53 bits, then 256 tags will cost 13568 bits; so to actually implement the cache you'd need a minimum of 18080 bytes. Of course in C (where you can't have an array of 53-bit integers) it'd cost a little more for padding/alignment (the array of tags would end up costing 64 bits per tag instead).
To find a cache line in the cache you'd do something like:
uint8_t *getCacheLine(uint32_t address) {
int setIndex = (address >> OFFSET_BITS) & (( 1 << SET_INDEX_BITS) - 1);
int myTag = address >> (OFFSET_BITS + SET_INDEX_BITS);
for(int way = 0; way < NWAYS; way++) {
if(myTag == myCacheTags[setIndex][way]) {
return myCache[setIndex][way];
}
}
return NULL; // Cache miss
}
Note: Typically the tag contains some kind of "valid or invalid" flag (in case an entry in the cache contains nothing at all), and typically the tag also contains something to represent the how recently used the cache line is (for some kind of "least recently used" eviction algorithm). The example code I've provided is incomplete - it doesn't mask off these extra bits when doing if(myTag == myCacheTags[setIndex][way]), it doesn't check any valid/invalid flag, and it doesn't update the tag to indicate that the cache line was recently used.

Related

Reading CAN message (PCAN-Router Pro FD)

please I have a problem with writing a code which will read a CAN message, edit it (limit to some maximum value) and then send back with same ID.
I´m using PCAN-Router Pro FD and will show you their example of such thing - basically same as mine but I have no idea what some of the numbers or operations are. [1]: https://i.stack.imgur.com/6ZDHn.jpg
My task is to: 1) Read CAN message with these parameters (ID = 0x120h, startbit 8, length 8 bit and factor 0,75)
2) Limit this value to 100 (because the message should have info about coolant temperature.)
3) If the value was below 100, dont change anything. If it was higher, change it to 100.
Thanks for any help !
Original code:
// catch ID 180h and limit a signal to a maximum
else if ( RxMsg.id == 0x180 && RxMsg.msgtype == CAN_MSGTYPE_STANDARD)
{
uint32_t speed;
// get the signal (intel format)
speed = ( RxMsg.data32[0] >> 12) & 0x1FFF;
// limit value
if ( speed > 6200)
{ speed = 6200;}
// replace the original value
RxMsg.data32[0] &= ~( 0x1FFF << 12);
RxMsg.data32[0] |= speed << 12;
}

After consulting the matter in person, we have found the answer.
The structure type of the RxMsg contains a union allowing the data to be accessed in 4-Byte chunks RxMsg.data32, 2-Byte chunks RxMsg.data16, or 1-Byte chunks RxMsg.data8. Since the temperature is located at the 8th bit and it is 1 Byte long, it can be accessed without using the binary masks, bit shifts and bitwise-logical-assignment operators at all.
// more if-else statements...
else if (RxMsg.id == 0x120 && RxMsg.msgtype == CAN_MSGTYPE_STANDARD)
{
uint8_t temperature = RxMsg.data8[1];
float factor = 0.75;
if (temperature * factor > 100.0)
{
temperature = (int)(100 / factor);
}
RxMsg.data8[1] = temperature;
}
The answer assumes that the startbit is the most significant bit in the message buffer and that the temperature value must be scaled down by the mentioned factor. Should the startbit mean the least significant bit, the [1] index could just be swapped out for [62], as the message buffer contains 64 Bytes in total.
The question author was not provided with a reference sheet for data format, so the answer is based purely on the information mentioned in the question. The temperature scaling factor is yet to be tested (will edit this after confirming it works).

What's a performant and clean way to parse a binary file in C?

I'm parsing a custom binary file structure for which I know the format.
The general idea is that each file is broken up into blocks of sequential bytes, which I want to separate and decode in parallel.
I'm looking for a readable, performant alternative to decode_block()
Here's what I'm currently working with:
#include <stdio.h>
int decode_block(uint8_t buffer[]);
int main(){
FILE *ptr;
ptr = fopen("example_file.bin", "rb");
if (!ptr){
printf("can't open.\n");
return 1;
}
int block1_size = 2404;
uint8_t block1_buffer[block1_size];
fread(block1_buffer, sizeof(char), block1_size, ptr);
int block2_size = 3422;
uint8_t block2_buffer[block2_size];
fread(block2_buffer, sizeof(char), block2_size, ptr);
fclose(ptr);
//Do these in parallel
decode_block(block1_buffer);
decode_block(block2_buffer);
return 0;
}
int decode_block(uint8_t buffer[]){
unsigned int size_of_block = (buffer[3] << 24) + (buffer[2] << 16) + (buffer[1] << 8) + buffer[0];
unsigned int format_version = buffer[4];
unsigned int number_of_values = (buffer[8] << 24) + (buffer[7] << 16) + (buffer[6] << 8) + buffer[5];
unsigned int first_value = (buffer[10] << 8) + buffer[9];
// On and on and on
int ptr = first_value
int values[number_of_values];
for(int i = 0; i < number_of_values; i++){
values[i] = (buffer[ptr + 3] << 24) + (buffer[ptr + 2] << 16) + (buffer[ptr + 1] << 8) + buffer[ptr];
ptr += 4
}
// On and on and on
return 0
}
It feels a little redundant to be reading the entire file into a byte array and then interpreting the array byte by byte. Also it makes for very bulky code.
But since I need to operate on multiple parts of the file in parallel I can't think of another way to do this. Also, is there a simpler or faster way to convert the early bytes in buffer into their respected metadata values?

I'd:
use "memory mapped files" to avoid loading the raw data (e.g. mmap() in POSIX systems). Note that this is not portable "plain C", but almost every OS supports a way to do this.
make sure that the file format specification requires that the values are aligned to a 4-byte boundary in the file and (if you actually do need to support signed integers) that the values are stored in "2's compliment" format (and not "sign and magnitude" or anything else)
check that the file complies with the specification as much as possible (not just the alignment requirement, but including things like "data can't start in middle of header", "data start + entries * entry_size can't exceed file's size", "version not recognized", etc).
have different code for little-endian machines (e.g. where which code is used may be selected at compile time with an #ifdef), where you can cast the memory mapped file's data to int32_t (or uint32_t). Note that the code you've shown (e.g. (buffer[ptr + 3] << 24) + (buffer[ptr + 2] << 16) + (buffer[ptr + 1] << 8) + buffer[ptr]) is broken for negative numbers (even on "2's compliment" machines); so the alternative code (for "not little-endian" cases) will be more complicated (and slower) than yours is. Of course if you don't need to support negative numbers you should not be using any signed integer type (e.g. int), and quite frankly you shouldn't be using "possibly 16 bit" int for 32-bit values anyway.
determine how many threads you should use (maybe command line argument; maybe by asking OS how many CPUs the computer actually has). Start the threads and tell them which "thread number" they are (where existing thread is number 0, first spawned thread is number 1, etc).
let the threads calculate their starting and ending offset (in the memory mapped file) from their "thread number", a global "total threads", a global "total entries" and a global "offset of first entry". This is mostly just division with special care for rounding. Note that (to avoid global variables) you could pass a structure containing the details to each thread instead. No safeguards (e.g. locks, critical sections) will be needed for this data because threads only read it.
let each thread parse its section of the data in parallel; then wait for them all to finish (e.g. maybe "thread number 0" does "pthread_join()" if you don't want to keep the threads for later use).
You will probably also need to check that all values (parsed by all threads) are within an allowed range (to comply with the file format specification); and have some kind of error handling for when they don't (e.g. when the file is corrupt or has been maliciously tampered with). This could be as simple as a (global, atomically incremented) "number of dodgy values found so far" counter; which could allow you to display an "N dodgy values found" error message after all values are parsed.
Note 1: If you don't want to use a memory mapped file (or can't); you can have one "file reader thread" and multiple "file parser threads". This takes a lot more synchronization (it devolves into a FIFO queue with flow control - e.g. with provider thread doing some kind of "while queue full { wait }" and consumer threads doing some kind of "while queue empty { wait }"). This extra synchronization will increase overhead and make it slower (in addition to being more complex), compared to using memory mapped files.
Note 2: If the file's data isn't cached by the operating system's file data cache, then you'll probably be bottlenecked by file IO regardless of what you do and using multiple threads probably won't help performance for that case.

Micro-optimizing a linear search loop over a huge array with OpenMP: can't break on a hit

I have a loop that takes between 90% and 99% of the program time approximately. It reads a huge LUT, and this loop is executed > 100,000 times, so it deserves some optimization.
EDIT:
The LUT (actually there are various arrays that compose the LUT) is made of arrays of ptrdiff_t and of unsigned __int128. They have to be that wide because of the algorithm (especially the 128 bit ones). T_RDY is the only bool array.
EDIT:
The LUT stores past combinations used to try to solve a problem that didn't work. There's no relation between them (that I can see yet), so I don't see a more appropriate search pattern.
The single threaded version of the loop is:
k = false;
for (ptrdiff_t i = 0; i < T_IND; i++) {
if (T_RDY[i] && !(~T_RWS[i] & M_RWS) && ((T_NUM[i] + P_LVL) <= P_LEN)) {
k = true;
break;
}
}
With this code, which makes use of OpenMP, I reduced the time between 2x and 3x in a 4 core processor:
k = false;
#pragma omp parallel for shared(k)
for (ptrdiff_t i = 0; i < T_IND; i++) {
if (k)
continue;
if (T_RDY[i] && !(~T_RWS[i] & M_RWS) && ((T_NUM[i] + P_LVL) <= P_LEN))
k = true;
}
EDIT:
Info about the data used:
#define DIM_MAX 128
#define P_LEN prb_lvl[0]
#define P_LVL prb_lvl[1]
#define M_RWS prb_mtx_rws[prb_lvl[1]]
#define T_RWS prb_tab
#define T_NUM prb_tab_num
#define T_RDY prb_tab_rdy
#define T_IND prb_tab_ind
extern ptrdiff_t prb_lvl [2];
extern uint128_t prb_mtx_rws [DIM_MAX];
extern uint128_t prb_tab [10000000];
extern ptrdiff_t prb_tab_num [10000000];
extern bool prb_tab_rdy [10000000];
extern ptrdiff_t prb_tab_ind;
However, the fact that I don't get an improvement of approx. 4x means that it introduces an overhead, which I guess goes from 2x to 1.5x. Part of the overhead is unavoidable (creating and destroying the threads), but there's some new overhead due to the facts that OpenMP doesn't allow to break from a parallel loop and that I added an if to each iteration, and I would like to get rid of it if possible.
Is there any other optimization that I could apply? Maybe using pthreads instead.
Should I bother editing some assembly?
I'm using GCC 9 with -O3 -flto (among others).
EDIT:
CPU: i7-5775C
But I plan to use other x64 CPUs with more cores.

You can coalesce k into bit tables and then do comparisons 64 at a time. If an entry in the main tables change, recompute that bit in the bit table.
If different queries use different M_RWS or P_LVL or something, then you'd need separate caches for separate search inputs. Or rebuild the cache for their current values, if you do multiple queries between changes. But hopefully that's not the case, otherwise the all-caps names are misleading.
Set up k as a bit table
#define KSZ (10000000/64 + !!(10000000 % 63))
static uint64_t k[KSZ];
void init_k(void){
// We can split this up to minimize cache misses, see below
for (size_t i;i<10000000;++i)
k[i/64] |= (uint64_t)((!!T_RDY[i]) & (!(~T_RWS[i] & M_RWS)) &((T_NUM[i] + P_LVL) <= P_LEN) ) << (i&63);
}
You can find the bit-index into k by searching for a non-zero 64-bit chunk, then using a bitscan to find the bit within that chunk:
size_t k2index(void){
size_t i;
for (i=0; i<KSZ;++i)
if (k[i]) break;
return 64 * i + __builtin_ctzll(k[i]);
}
You may want to split up your data reads so that you get sequential data access (each table is over 40=80MB as described) and don't get a cache miss on every single iteration.
#define KSZ (10000000/64 + !!(10000000%63))
static uint64_t k[KSZ], k0[KSZ], k1[KSZ]; //use calloc instead?
void init_k(void){
//I split these up to minimize cache misses
for (size_t i;i<10000000;++i)
k[i/64] |= (uint64_t)(!!T_RDY[i]) << (i&63);
for (size_t i;i<10000000;++i)
k0[i/64] |= (uint64_t)(!(~T_RWS[i] & M_RWS)) << (i&63);
for (size_t i;i<10000000;++i)
k1[i/64] |= (uint64_t)((T_NUM[i] + P_LVL) <= P_LEN) << (i&63);
//now combine them 64 bits at a time
for (size_t i;i<KSZ;++i)
k[i] &= k0[i];
for (size_t i;i<KSZ;++i)
k[i] &= k1[i];
}
If you split it up like this, you could also initialize (some of) them when you set up your other tables. Or if the tables updated, you could update the k value as well.

How can I deal with given situtaion related to Hardware change

I am maintaining a Production code related to FPGA device .Earlier resisters on FPGA are of 32 bits and read/write to these registers are working fine.But Hardware is changed and so did the FPGA device and with latest version of FPGA device we have trouble in read and write to FPGA register .After some R&D we came to know FPGA registers are no longer 32 bit ,it is now 31 bit registers and same has been claimed by FPGA device vendor.
So there is need to change small code as well.Earlier we were checking that address of registers are 4 byte aligned or not(because registers are of 32 bits)now with current scenario we have to check address are 31 bit aligned.So for the same we are going to check
if the most significant bit of the address is set (which means it is not a valid 31 bit).
I guess we are ok here.
Now second scenario is bit tricky for me.
if read/write for multiple registers that is going to go over the 0x7fff-fffc (which is the maximum address in 31 bit scheme) boundary, then have to handle request carefully.
Reading and Writing for multiple register takes length as an argument which is nothing but number of register to be read or write.
For example, if the read starts with 0x7fff-fff8, and length for the read is 5. Then actually, we can only read 2 registers (which is 0x7fff-fff8, and 0x7fff-fffc).
Now could somebody suggest me some kind of pseudo code to handle this scenario
Some think like below
while(lenght>1)
{
if(!(address<<(lenght*31) <= 0x7fff-fffc))
{
length--;
}
}
I know it is not good enough but something in same line which I can use.
EDIT
I have come up with a piece of code which may fulfill my requirement
int count;
Index_addr=addr;
while(Index_add <= 7ffffffc)
{
/*Wanted to move register address to next register address,each register is 31 bit wide and are at consecutive location. like 0x0,0x4 and 0x8 etc.*/
Index_add=addr<<1; // Guess I am doing wrong here ,would anyone correct it.
count++;
}
length=count;

The root problem seems to be that the program is not properly treating the FPGA registers.
Data encapsulation would help, and, instead of treating the 31-bit FPGA registers as memory locations, they should be abstracted.
The FPGA should be treated as a vector (a one-dimensional array) of registers.
The vector of N FPGA registers should be addressable by an register index in the range of 0x0000 through N-1.
The FPGA registers are memory mapped at base addr.
So the memory address = 4 * FPGA register index + base addr.
Access to the FPGA registers should be encapsulated by read and write procedures:
int read_fpga_reg(int reg_index, uint32_t *reg_valp)
{
if (reg_index < 0 || reg_index >= MAX_REG_INDEX)
return -1; /* error return */
*reg_valp = *(uint32_t *)(reg_index << 2 + fpga_base_addr);
return 0;
}
As long as MAX_REG_INDEX and fpga_base_addr are properly defined, then this code will never generate an invalid memory access.

I'm not absolutely sure I'm interpreting the given scenario correctly. But here's a shot at it:
// Assuming "address" starts 4-byte aligned and is just defined as an integer
unsigned uint32_t address; // (Assuming 32-bit unsigned longs)
while ( length > 0 ) // length is in bytes
{
// READ 4-byte value at "address"
// Mask the read value with 0x7FFFFFFF since there are 31 valid bits
// 32 bits (4 bytes) have been read
if ( (--length > 0) && (address < 0x7ffffffc) )
address += 4;
}

Finding position of '1's efficiently in an bit array

I'm wiring a program that tests a set of wires for open or short circuits. The program, which runs on an AVR, drives a test vector (a walking '1') onto the wires and receives the result back. It compares this resultant vector with the expected data which is already stored on an SD Card or external EEPROM.
Here's an example, assume we have a set of 8 wires all of which are straight through i.e. they have no junctions. So if we drive 0b00000010 we should receive 0b00000010.
Suppose we receive 0b11000010. This implies there is a short circuit between wire 7,8 and wire 2. I can detect which bits I'm interested in by 0b00000010 ^ 0b11000010 = 0b11000000. This tells me clearly wire 7 and 8 are at fault but how do I find the position of these '1's efficiently in an large bit-array. It's easy to do this for just 8 wires using bit masks but the system I'm developing must handle up to 300 wires (bits). Before I started using macros like the following and testing each bit in an array of 300*300-bits I wanted to ask here if there was a more elegant solution.
#define BITMASK(b) (1 << ((b) % 8))
#define BITSLOT(b) ((b / 8))
#define BITSET(a, b) ((a)[BITSLOT(b)] |= BITMASK(b))
#define BITCLEAR(a,b) ((a)[BITSLOT(b)] &= ~BITMASK(b))
#define BITTEST(a,b) ((a)[BITSLOT(b)] & BITMASK(b))
#define BITNSLOTS(nb) ((nb + 8 - 1) / 8)
Just to further show how to detect an open circuit. Expected data: 0b00000010, received data: 0b00000000 (the wire isn't pulled high). 0b00000010 ^ 0b00000000 = 0b0b00000010 - wire 2 is open.
NOTE: I know testing 300 wires is not something the tiny RAM inside an AVR Mega 1281 can handle, that is why I'll split this into groups i.e. test 50 wires, compare, display result and then move forward.

Many architectures provide specific instructions for locating the first set bit in a word, or for counting the number of set bits. Compilers usually provide intrinsics for these operations, so that you don't have to write inline assembly. GCC, for example, provides __builtin_ffs, __builtin_ctz, __builtin_popcount, etc., each of which should map to the appropriate instruction on the target architecture, exploiting bit-level parallelism.
If the target architecture doesn't support these, an efficient software implementation is emitted by the compiler. The naive approach of testing the vector bit by bit in software is not very efficient.
If your compiler doesn't implement these, you can still code your own implementation using a de Bruijn sequence.

How often do you expect faults? If you don't expect them that often, then it seems pointless to optimize the "fault exists" case -- the only part that will really matter for speed is the "no fault" case.
To optimize the no-fault case, simply XOR the actual result with the expected result and a input ^ expected == 0 test to see if any bits are set.
You can use a similar strategy to optimize the "few faults" case, if you further expect the number of faults to typically be small when they do exist -- mask the input ^ expected value to get just the first 8 bits, just the second 8 bits, and so on, and compare each of those results to zero. Then, you just need to search for the set bits within the ones that are not equal to zero, which should narrow the search space to something that can be done pretty quickly.

You can use a lookup table. For example log-base-2 lookup table of 255 bytes can be used to find the most-significant 1-bit in a byte:
uint8_t bit1 = log2[bit_mask];
where log2 is defined as follows:
uint8_t const log2[] = {
0, /* not used log2[0] */
0, /* log2[0x01] */
1, 1 /* log2[0x02], log2[0x03] */
2, 2, 2, 2, /* log2[0x04],..,log2[0x07] */
3, 3, 3, 3, 3, 3, 3, 3, /* log2[0x08],..,log2[0x0F */
...
}
On most processors a lookup table like this will go to ROM. But AVR is a Harvard machine and to place data in code space (ROM) requires special non-standard extension, which depends on the compiler. For example the IAR AVR compiler would need use the extended keyword __flash. In WinAVR (GNU AVR) you would need to use the PROGMEM attribute, but it's more complex than that, because you would also need to use special macros to to read from the program space.

I think there is only one way to do this:
Create an array out "outdata". Each item of the array can for example correspond an 8-bit port register.
Send the outdata on the wires.
Read back this data as "indata".
Store the indata in an array mapped exactly as the outdata.
In a loop, XOR each byte of outdata with each byte of indata.
I would strongly recommend inline functions instead of those macros.
Why can't your MCU handle 300 wires?
300/8 = 37.5 bytes. Rounded to 38. It needs to be stored twice, outdata and indata, 38*2 = 76 bytes.
You can't spare 76 bytes of RAM?

I think you're missing the forest through the trees. Seems like a bed of nails test. First test some assumptions:
1) You know which pins should be live for each pin tested/energized.
2) you have a netlist translated for step 1 into a file on sd
If you operate on a byte level as well as bit, it simplifies the issue. If you energize a pin, there is an expected pattern out stored in your file. First find the mismatched bytes; identify mismatched pins in the byte; finally store the energized pin with the faulty pin numbers.
You don't need an array for searching, or results. general idea:
numwires=300;
numbytes=numwires/8 + (numwires%8)?1:0;
for(unsigned char currbyte=0; currbyte<numbytes; currbyte++)
{
unsigned char testbyte=inchar(baseaddr+currbyte)
unsigned char goodbyte=getgoodbyte(testpin,currbyte/*byte offset*/);
if( testbyte ^ goodbyte){
// have a mismatch report the pins
for(j=0, mask=0x01; mask<0x80;mask<<=1, j++){
if( (mask & testbyte) != (mask & goodbyte)) // for clarity
logbadpin(testpin, currbyte*8+j/*pin/wirevalue*/, mask & testbyte /*bad value*/);
}
}