Related
Im writing a real time DSP processing library.
My intention is to give it a flexibility to define input samples blockSize, while also having best possible performance in case of sample-by-sample processing, that is - single sample block size
I think I have to use volatile keyword defining loop variable since data processing will be using pointers to Inputs/Outputs.
This leads me to a question:
Will gcc compiler optimize this code
int blockSize = 1;
for (volatile int i=0; i<blockSize; i++)
{
foo()
}
or
//.h
#define BLOCKSIZE 1
//.c
for (volatile int i=0; i<BLOCKSIZE; i++)
{
foo()
}
to be same as simply calling body of the loop:
foo()
?
Thx
I think I have to use volatile keyword defining loop variable since data processing will be using pointers to Inputs/Outputs.
No, that doesn't make any sense. Only the input/output hardware registers themselves should be volatile. Pointers to them should be declared as pointer-to-volatile data, ie volatile uint8_t*. There is no need to make the pointer itself volatile, ie uint8_t* volatile //wrong.
As things stand now, you force the compiler to create a variable i and increase it, which will likely block loop unrolling optimizations.
Trying your code on gcc x86 with -O3 this is exactly what happens. No matter the size of BLOCKSIZE, it still generates the loop because of volatile. If I drop volatile it completely unrolls the loop up to BLOCKSIZE == 7 and replace it with a number of function calls. Beyond 8 it creates a loop (but keeps the iterator in a register instead of RAM).
x86 example:
for (int i=0; i<5; i++)
{
foo();
}
gives
call foo
call foo
call foo
call foo
call foo
But
for (volatile int i=0; i<5; i++)
{
foo();
}
gives way more inefficient
mov DWORD PTR [rsp+12], 0
mov eax, DWORD PTR [rsp+12]
cmp eax, 4
jg .L2
.L3:
call foo
mov eax, DWORD PTR [rsp+12]
add eax, 1
mov DWORD PTR [rsp+12], eax
mov eax, DWORD PTR [rsp+12]
cmp eax, 4
jle .L3
.L2:
For further study of the correct use of volatile in embedded systems, please see:
How to access a hardware register from firmware?
Using volatile in embedded C development
Since the loop variable is volatile it shouldn't optimize it. The compiler can not know wether i will be 1 when the condition is evaluated, so it has to keep the loop.
From the compiler point of view, the loop can run an indeterminite number of times until the condition is satisfied.
If you somehwere access hardware registers, then those should be declared volatile, which would make more sense, to the reader, and also allows the compiler to apply appropriate optimizations where possible.
volatile keyword says the compiler that the variable is side effects prone - ie it can be changed by something which is not visible for the compiler.
Because of that volatile variables have to read before every use and saved to their permanent storage location after every modification.
In your example the loop cannot be optimized as variable i can be changed during the loop (for example some interrupt routine will change it to zero so the loop will have to be executed again.
The answer to your question is: If the compiler can determine that every time you enter the loop, it will execute only once, then it can eliminate the loop.
Normally, the optimization phase unrolls the loops, based on how the iterations relate to one another, this makes your (e.g. indefinite) loop to get several times bigger, in exchange to avoid the back loops (that normally result in a bubble in the pipeline, depending on the cpu type) but not too much to lose cache hits.... so it is a bit complicate... but the earnings are huge. But if your loop executes only once, and always, is normally because the test you wrote is always true (a tautology) or always false (impossible fact) and can be eliminated, this makes the jump back unnecessary, and so, there's no loop anymore.
int blockSize = 1;
for (volatile int i=0; i<blockSize; i++)
{
foo(); // you missed a semicolon here.
}
In your case, the variable is assigned a value, that is never touched anymore, so the first thing the compiler is going to do is to replace all expressions of your variable by the literal you assigned to it. (lacking context I assume blocsize is a local automatic variable that is not changed anywhere else) Your code changes into:
for (volatile int i=0; i<1; i++)
{
foo();
}
the next is that volatile is not necessary, as its scope is the block body of the loop, where it is not used, so it can be replaced by a sequence of code like the following:
do {
foo();
} while (0);
hmmm.... this code can be replaced by this code:
foo();
The compiler analyses each data set analising the graph of dependencies between data and variables.... when a variable is not needed anymore, assigning a value to it is not necessary (if it is not used later in the program or goes out of life), so that code is eliminated. If you make your compiler to compile a for loop frrom 1 to 2^64, and then stop. and you optimize the compilation of that,, you will see you loop being trashed up and will get the false idea that your processor is capable of counting from 1 to 2^64 in less than a second.... but that is not true, 2^64 is still very big number to be counted in less than a second. And that is not a one fixed pass loop like yours.... but the data calculations done in the program are of no use, so the compiler eliminates it.
Just test the following program (in this case it is not a test of a just one pass loop, but 2^64-1 executions):
#include <stdint.h>
#include <stdio.h>
#include <unistd.h>
int main()
{
uint64_t low = 0UL;
uint64_t high = ~0UL;
uint64_t data = 0; // this data is updated in the loop body.
printf("counting from %lu to %lu\n", low, high);
alarm(10); /* security break after 10 seconds */
for (uint64_t i = low; i < high; i++) {
#if 0
printf("data = $lu\n", data = i ); // either here...
#else
data = i; // or here...
#endif
}
return 0;
}
(You can change the #if 0 to #if 1 to see how the optimizer doesn't eliminate the loop when you need to print the results, but you see that the program is essentially the same, except for the call to printf with the result of the assignment)
Just compile it with/without optimization:
$ cc -O0 pru.c -o pru_noopt
$ cc -O2 pru.c -o pru_optim
and then run it under time:
$ time pru_noopt
counting from 0 to 18446744073709551615
Alarm clock
real 0m10,005s
user 0m9,848s
sys 0m0,000s
while running the optimized version gives:
$ time pru_optim
counting from 0 to 18446744073709551615
real 0m0,002s
user 0m0,002s
sys 0m0,002s
(impossible, neither the best computer can count one after the other, upto that number in less than 2 milliseconds) so the loop must have gone somewhere else. You can check from the assembler code. As the updated value of data is not used after assignment, the loop body can be eliminated, so the 2^64-1 executions of it can also be eliminated.
Now add the following line after the loop:
printf("data = %lu\n", data);
You will see that then, even with the -O3 option, will get the loop untouched, because the value after all the assignments is used after the loop.
(I preferred not to show the assembler code, and remain in high level, but you can have a look at the assembler code and see the actual generated code)
I have an OS project that I am working on and I am trying to call data that I have read from the disk in C with inline assembly.
I have already tried reading the code and executing it with the assembly call instruction, using inline assembly.
void driveLoop() {
uint16_t sectors = 31;
uint16_t sector = 0;
uint16_t basesector = 40000;
uint32_t i = 40031;
uint16_t code[sectors][256];
int x = 0;
while(x==0) {
read(i);
for (int p=0; p < 256; p++) {
if (readOut[p] == 0) {
} else {
x = 1;
//kprint_int(i);
}
}
i++;
}
kprint("Found sector!\n");
kprint("Loading OS into memory...\n");
for (sector=0; sector<sectors; sector++) {
read(basesector+sector);
for (int p=0; p<256; p++) {
code[sector][p] = readOut[p];
}
}
kprint("Done loading.\n");
kprint("Attempting to call...\n");
asm volatile("call (%0)" : : "r" (&code));
When the inline assembly is called I expect it to run the code from the sectors I read from the "disk" (this is in a VM, because its a hobby OS). What it does instead is it just hangs.
I probably don't much understand how variables, arrays, and assembly work, so if you could fill me in, that would be nice.
EDIT: The data I am reading from the disk is a binary file that was added
to the disk image file with
cat kernel.bin >> disk.img
and the kernel.bin is compiled with
i686-elf-ld -o kernel.bin -Ttext 0x4C4B40 *insert .o files here* --oformat binary
What it does instead is it just hangs.
Run your OS inside BOCHS so you can use BOCHS's built-in debugger to see exactly where it's stuck.
Being able to debug lockups, including with interrupts disabled, is probably very useful...
asm volatile("call (%0)" : : "r" (&code)); is unsafe because of missing clobbers.
But even worse than that it will load a new EIP value from the first 4 bytes of the array, instead of setting EIP to that address. (Unless the data you're loading is an array of pointers, not actual machine code?)
You have the %0 in parentheses, so it's an addressing mode. The assembler will warn you about an indirect call without *, but will assemble it like call *(%eax), with EAX = the address of code[0][0]. You actually want a call *%eax or whatever register the compiler chooses, register-indirect not memory-indirect.
&code and code are both just a pointer to the start of the array; &code doesn't create an anonymous pointer object storing the address of another address. &code takes the address of the array as a whole. code in this context "decays" to a pointer to the first object.
https://gcc.gnu.org/wiki/DontUseInlineAsm (for this).
You can get the compiler to emit a call instruction by casting the pointer to a function pointer.
__builtin___clear_cache(&code[0][0], &code[30][255]); // don't optimize away stores into the buffer
void (*fptr)(void) = (void*)code; // casting to void* instead of the actual target type is simpler
fptr();
That will compile (with optimization enabled) to something like lea 16(%esp), %eax / call *%eax, for 32-bit x86, because your code[][] buffer is an array on the stack.
Or to have it emit a jmp instead, do it at the end of a void function, or return funcptr(); in a non-void function, so the compiler can optimize the call/ret into a jmp tailcall.
If it doesn't return, you can declare it with __attribute__((noreturn)).
Make sure the memory page / segment is executable. (Your uint16_t code[]; is a local, so gcc will allocate it on the stack. This might not be what you want. The size is a compile-time constant so you could make it static, but if you do that for other arrays in other sibling functions (not parent or child), then you lose out on the ability to reuse a big chunk of stack memory for different arrays.)
This is much better than your unsafe inline asm. (You forgot a "memory" clobber, so nothing tells the compiler that your asm actually reads the pointed-to memory). Also, you forgot to declare any register clobbers; presumably the block of code you loaded will have clobbered some registers if it returns, unless it's written to save/restore everything.
In GNU C you do need to use __builtin__clear_cache when casting a data pointer to a function pointer. On x86 it doesn't actually clear any cache, it's telling the compiler that the stores to that memory are not dead because it's going to be read by execution. See How does __builtin___clear_cache work?
Without that, gcc could optimize away the copying into uint16_t code[sectors][256]; because it looks like a dead store. (Just like with your current inline asm which only asks for the pointer in a register.)
As a bonus, this part of your OS becomes portable to other architectures, including ones like ARM without coherent instruction caches where that builtin expands to a actual instructions. (On x86 it purely affects the optimizer).
read(basesector+sector);
It would probably be a good idea for your read function to take a destination pointer to read into, so you don't need to bounce data through your readOut buffer.
Also, I don't see why you'd want to declare your code as a 2D array; sectors are an artifact of how you're doing your disk I/O, not relevant to using the code after it's loaded. The sector-at-a-time thing should only be in the code for the loop that loads the data, not visible in other parts of your program.
char code[sectors * 512]; would be good.
Lets say i have a lookuptable, an array of 256 elements defined and declared in a header named lut.h. The array will be accessed multiple times in lifetime of the program.
From my understanding if its defined & declared as static, it will remain in memory until the program is done, i.e. if it is a task running on a uC, the array is in memory the entire time.
Where as without static, it will be loaded into memory when accessed.
In lut.h
static const float array[256] = {1.342, 14.21, 42.312, ...}
vs.
const float array[256] = {1.342, 14.21, 42.312, ...}
Considering the uC has limited spiflash and psram, what would be the most performance oriented approach?
You have some misconceptions here, since a MCU is not a PC. Everything in memory in a MCU will persist for as long as the MCU has power. Programs do not end or return resources to a hosting OS.
"Tasks" on a MCU means you have a RTOS. They use their own stack and that's a topic of it's own, quite unrelated to your question. It is normal that all tasks on a RTOS execute forever, rather than getting allocated/deallocated in run-time like processes in a PC.
static versus automatic on local scope does mean different RAM memory use, but not necessarily more/less memory use. Local variables get pushed/popped on the stack as the program executes. static ones sit on their designated address.
Where as without static, it will be loaded into memory when accessed.
Only if the array you are loading into is declared locally. That is:
void func (void)
{
int my_local_array[] = {1,2,3};
...
}
Here my_local_array will load the values from flash to RAM during execution of that function only. This means two things:
The actual copy down from flash to RAM is slow. First of all, copying something is always slow, regardless of the situation. But in the specific case of copying from RAM to flash, it might be extra slow, depending on MCU.
It will be extra slow on high end MCUs with flash wait states that fail to utilize data cache for the copy. It will be extra slow on weird Harvard architecture MCUs that can't address data directly. Etc.
So naturally if you do this copy down each time a function is called, instead of just once, your program will turn much slower.
Large local objects lead to a need for higher stack size. The stack must be large enough to deal with the worst-case scenario. If you have large local objects, the stack size will need to be set much higher to prevent stack overflows. Meaning this can actually lead to less effective memory use.
So it isn't trivial to tell if you save or lose memory by making an object local.
General good practice design in embedded systems programming is to not allocate large objects on the stack, as they make stack handling much more detailed and the potential for stack overflow increases. Such objects should be declared as static, at file scope. Particularly if speed is important.
static const float array vs const float array
Another misconception here. Making something const in MCU system, while at the same time placing it at file scope ("global"), most likely means that the variable will end up in flash ROM, not in RAM. Regardless of static.
This is most of the time preferred, since in general RAM is a more valuable resource than flash. The role static plays here is merely good program design, as it limits access to the variable to the local translation unit, rather than cluttering up the global namespace.
In lut.h
You should never define variables in header files.
It is bad from a program design point-of-view, as you expose the variable all over the place ("spaghetti programming") and it is bad from a linker point of view, if multiple source files include the same header file - which is extremely likely.
Correctly designed programs places the variable in the .c file and limits access by declaring it static. Access from the outside, if needed, is done through setters/getters.
he uC has limited spiflash
What is "spiflash"? An external serial flash memory accessed through SPI? Then none of this makes sense, since such flash memory isn't memory-mapped and typically the compiler can't utilize it. Access to such memories has to be carried out by your application, manually.
If your arrays are defined on a file level (you mentioned lut.h), and both have const qualifiers, they will not be loaded into RAM¹. The static keyword only limits the scope of the array, it doesn't change its lifetime in any way. If you check the assembly for your code, you will see that both arrays look exactly the same when compiled:
static const int static_array[] = { 1, 2, 3 };
const int extern_array[] = { 1, 2, 3};
extern void do_something(const int * a);
int main(void)
{
do_something(static_array);
do_something(extern_array);
return 0;
}
Resulting assembly:
main:
sub rsp, 8
mov edi, OFFSET FLAT:static_array
call do_something
mov edi, OFFSET FLAT:extern_array
call do_something
xor eax, eax
add rsp, 8
ret
extern_array:
.long 1
.long 2
.long 3
static_array:
.long 1
.long 2
.long 3
On the other hand, if if you declare the arrays inside a function, then the array will be copied to temporary storage (stack) for the duration of the function, unless you add the static qualifier:
extern void do_something(const int * a);
int main(void)
{
static const int static_local_array[] = { 1, 2, 3 };
const int local_array[] = { 1, 2, 3 };
do_something(static_local_array);
do_something(local_array);
return 0;
}
Resulting assembly:
main:
sub rsp, 24
mov edi, OFFSET FLAT:static_local_array
movabs rax, 8589934593
mov QWORD PTR [rsp+4], rax
mov DWORD PTR [rsp+12], 3
call do_something
lea rdi, [rsp+4]
call do_something
xor eax, eax
add rsp, 24
ret
static_local_array:
.long 1
.long 2
.long 3
¹ More precisely, it depends on the compiler. Some compilers will need additional custom attributes to define exactly where you want to store the data. Some compilers will try to place the array into RAM when there is enough spare space, to allow faster reading.
My problem explained:
On my microcontroller (Atmel AT90CAN128) i have about 2500 bytes of RAM left.
In those 2500 bytes i need to store 5 times 100 data sets (size could change in the future). The data sets have a predefined but varying length between 1 and 9 bytes. The total bytes that the pure data sets occupy is about 2000 bytes. I now need to be able to access the data sets in an array like fashion by passing a uint8 to a function and get a pointer to the data set in return.
But i only have about 500 bytes left, so an array with pointers to each data set (calculated at start of run time) is simply not possible.
My attempt:
i use one big uint8 array[2000] (in RAM) and the length of the data sets is stored in flash as const uint8[] = {1, 5, 9, ...};.
The position of the data set in the big array is the accumulated length of the sets before it. So i would have to iterate through the length array and add the values up and then use it as an offset to the pointer of the big data array.
At runtime this gives me bad performance. The position of the data sets within the big array IS KNOWN at compile time, I just dont know how to put this information into an array that the compiler can store into flash.
As the amount of data sets could change, i need a solution that automatically calculates the positions.
Goal:
something like that
uint8 index = 57;
uint8 *pointer_to_data = pointer_array[57];
Is this even possible, as the compiler is a 1 pass comiler ?
(I am using Codevision, not avr gcc)
My solution
The pure C solution/answer is technically the right answer for my question but it just seems overly complicated (from my perspective). The idea with the build script seemed better but codevision is not very practical in that way.
So i ended up with a bit of a mix.
I wrote a javascript that writes the C code/definition of the variables for me. The raw-definitions are easy to edit and i just copy paste the whole thing into a html text file and open it in a browser and copy paste the content back into my C file.
In the beginning i was missing a crucial element and that is the position of the 'flash' keyword in the definition. The following is a simplified output of my javascript that compiles just the way i like it.
flash uint8 len[150] = {4, 4, 0, 2, ...};
uint8 data1[241] = {0}; //accumulated from above
uint8 * flash pointers_1[150] = {data1 +0, data1 +4, data1 +0, data1 +8, ...};
The ugly part (lots of manual labor without script) is adding up the length for each pointer as the compiler will only compile if the pointer is increased by a constant and not a value stored in a constant array.
The raw definitions that are fed to the javascript then look like this
var strings = [
"len[0] = 4;",
"len[1] = 4;",
"len[3] = 2;",
...
Within the javascript it is an array of strings, this way i could copy my old definitions into it and just add some quotes. I only need to define the ones that i want to use, index 2 is not defined and the script uses length 0 for it but does include it. The macro would have needed an entry with 0 i guess, which is bad for overview in my case.
It is not a one click solution but it is very readable and tidy which makes up for the copy-paste.
One common method of packing variable-length data sets to a single continuous array is using one element to describe the length of the next data sequence, followed by that many data items, with a zero length terminating the array.
In other words, if you have data "strings" 1, 2 3, 4 5 6, and 7 8 9 10, you can pack them into an array of 1+1+1+2+1+3+1+4+1 = 15 bytes as 1 1 2 2 3 3 4 5 6 4 7 8 9 10 0.
The functions to access said sequences are quite simple, too. In OP's case, each data item is an uint8:
uint8 dataset[] = { ..., 0 };
To loop over each set, you use two variables: one for the offset of current set, and another for the length:
uint16 offset = 0;
while (1) {
const uint8 length = dataset[offset];
if (!length) {
offset = 0;
break;
} else
++offset;
/* You have 'length' uint8's at dataset+offset. */
/* Skip to next set. */
offset += length;
}
To find a specific dataset, you do need to find it using a loop. For example:
uint8 *find_dataset(const uint16 index)
{
uint16 offset = 0;
uint16 count = 0;
while (1) {
const uint8 length = dataset[offset];
if (length == 0)
return NULL;
else
if (count == index)
return dataset + offset;
offset += 1 + length;
count++;
}
}
The above function will return a pointer to the length item of the index'th set (0 referring to the first set, 1 to the second set, and so on), or NULL if there is no such set.
It is not difficult to write functions to remove, append, prepend, and insert new sets. (When prepending and inserting, you do need to copy the rest of the elements in the dataset array forward (to higher indexes), by 1+length elements, first; this means that you cannot access the array in an interrupt context or from a second core, while the array is being modified.)
If the data is immutable (for example, generated whenever a new firmware is uploaded to the microcontroller), and you have sufficient flash/rom available, you can use a separate array for each set, an array of pointers to each set, and an array of sizes of each set:
static const uint8 dataset_0[] PROGMEM = { 1 };
static const uint8 dataset_1[] PROGMEM = { 2, 3 };
static const uint8 dataset_2[] PROGMEM = { 4, 5, 6 };
static const uint8 dataset_3[] PROGMEM = { 7, 8, 9, 10 };
#define DATASETS 4
static const uint8 *dataset_ptr[DATASETS] PROGMEM = {
dataset_0,
dataset_1,
dataset_2,
dataset_3,
};
static const uint8 dataset_len[DATASETS] PROGMEM = {
sizeof dataset_0,
sizeof dataset_1,
sizeof dataset_2,
sizeof dataset_3,
};
When this data is generated at firmware compile time, it is common to put this into a separate header file, and simply include it from the main firmware .c source file (or, if the firmware is very complicated, from the specific .c source file that accesses the data sets). If the above is dataset.h, then the source file typically contains say
#include "dataset.h"
const uint8 dataset_length(const uint16 index)
{
return (index < DATASETS) ? dataset_len[index] : 0;
}
const uint8 *dataset_pointer_P(const uint16 index)
{
return (index < DATASETS) ? dataset_ptr[index] : NULL;
}
i.e., it includes the dataset, and then defines the functions that access the data. (Note that I deliberately made the data itself static, so they are only visible in the current compilation unit; but the dataset_length() and dataset_pointer(), the safe accessor functions, are accessible from other compilation units (C source files), too.)
When the build is controlled via a Makefile, this is trivial. Let's say the generated header file is dataset.h, and you have a shell script, say generate-dataset.sh, that generates the contents for that header. Then, the Makefile recipe is simply
dataset.h: generate-dataset.sh
#$(RM) $#
$(SHELL) -c "$^ > $#"
with the recipes for the compilation of the C source files that need it, containing it as a prerequisite:
main.o: main.c dataset.h
$(CC) $(CFLAGS) -c main.c
Do note that the indentation in Makefiles always uses Tabs, but this forum does not reproduce them in code snippets. (You can always run sed -e 's|^ *|\t|g' -i Makefile to fix copy-pasted Makefiles, though.)
OP mentioned that they are using Codevision, that does not use Makefiles (but a menu-driven configuration system). If Codevision does not provide a pre-build hook (to run an executable or script before compiling the source files), then OP can write a script or program run on the host machine, perhaps named pre-build, that regenerates all generated header files, and run it by hand before every build.
In the hybrid case, where you know the length of each data set at compile time, and it is immutable (constant), but the sets themselves vary at run time, you need to use a helper script to generate a rather large C header (or source) file. (It will have 1500 lines or more, and nobody should have to maintain that by hand.)
The idea is that you first declare each data set, but do not initialize them. This makes the C compiler reserve RAM for each:
static uint8 dataset_0_0[3];
static uint8 dataset_0_1[2];
static uint8 dataset_0_2[9];
static uint8 dataset_0_3[4];
/* : : */
static uint8 dataset_0_97[1];
static uint8 dataset_0_98[5];
static uint8 dataset_0_99[7];
static uint8 dataset_1_0[6];
static uint8 dataset_1_1[8];
/* : : */
static uint8 dataset_1_98[2];
static uint8 dataset_1_99[3];
static uint8 dataset_2_0[5];
/* : : : */
static uint8 dataset_4_99[9];
Next, declare an array that specifies the length of each set. Make this constant and PROGMEM, since it is immutable and goes into flash/rom:
static const uint8 dataset_len[5][100] PROGMEM = {
sizeof dataset_0_0, sizeof dataset_0_1, sizeof dataset_0_2,
/* ... */
sizeof dataset_4_97, sizeof dataset_4_98, sizeof dataset_4_99
};
Instead of the sizeof statements, you can also have your script output the lengths of each set as a decimal value.
Finally, create an array of pointers to the datasets. This array itself will be immutable (const and PROGMEM), but the targets, the datasets defined first above, are mutable:
static uint8 *const dataset_ptr[5][100] PROGMEM = {
dataset_0_0, dataset_0_1, dataset_0_2, dataset_0_3,
/* ... */
dataset_4_96, dataset_4_97, dataset_4_98, dataset_4_99
};
On AT90CAN128, the flash memory is at addresses 0x0 .. 0x1FFFF (131072 bytes total). Internal SRAM is at addresses 0x0100 .. 0x10FF (4096 bytes total). Like other AVRs, it uses Harvard architecture, where code resides in a separate address space -- in Flash. It has separate instructions for reading bytes from flash (LPM, ELPM).
Because a 16-bit pointer can only reach half the flash, it is rather important that the dataset_len and dataset_ptr arrays are "near", in the lower 64k. Your compiler should take care of this, though.
To generate correct code for accessing the arrays from flash (progmem), at least AVR-GCC needs some helper code:
#include <avr/pgmspace.h>
uint8 subset_len(const uint8 group, const uint8 set)
{
return pgm_read_byte_near(&(dataset_len[group][set]));
}
uint8 *subset_ptr(const uint8 group, const uint8 set)
{
return (uint8 *)pgm_read_word_near(&(dataset_ptr[group][set]));
}
The assembly code, annotated with the cycle counts, avr-gcc-4.9.2 generates for at90can128 from above, is
subset_len:
ldi r25, 0 ; 1 cycle
movw r30, r24 ; 1 cycle
lsl r30 ; 1 cycle
rol r31 ; 1 cycle
add r30, r24 ; 1 cycle
adc r31, r25 ; 1 cycle
add r30, r22 ; 1 cycle
adc r31, __zero_reg__ ; 1 cycle
subi r30, lo8(-(dataset_len)) ; 1 cycle
sbci r31, hi8(-(dataset_len)) ; 1 cycle
lpm r24, Z ; 3 cycles
ret
subset_ptr:
ldi r25, 0 ; 1 cycle
movw r30, r24 ; 1 cycle
lsl r30 ; 1 cycle
rol r31 ; 1 cycle
add r30, r24 ; 1 cycle
adc r31, r25 ; 1 cycle
add r30, r22 ; 1 cycle
adc r31, __zero_reg__ ; 1 cycle
lsl r30 ; 1 cycle
rol r31 ; 1 cycle
subi r30, lo8(-(dataset_ptr)) ; 1 cycle
sbci r31, hi8(-(dataset_ptr)) ; 1 cycle
lpm r24, Z+ ; 3 cycles
lpm r25, Z ; 3 cycles
ret
Of course, declaring subset_len and subset_ptr as static inline would indicate to the compiler you want them inlined, which increases the code size a bit, but might shave off a couple of cycles per invocation.
Note that I have verified the above (except using unsigned char instead of uint8) for at90can128 using avr-gcc 4.9.2.
First, you should put the predefined length array in flash using PROGMEM, if you haven't already.
You could write a script, using the predefined length array as input, to generate a .c (or cpp) file that contains the PROGMEM array definition. Here is an example in python:
# Assume the array that defines the data length is in a file named DataLengthArray.c
# and the array is of the format
# const uint16 dataLengthArray[] PROGMEM = {
# 2, 4, 5, 1, 2,
# 4 ... };
START_OF_ARRAY = "const uint16 dataLengthArray[] PROGMEM = {"
outFile = open('PointerArray.c', 'w')
with open("DataLengthArray.c") as f:
fc = f.read().replace('\n', '')
dataLengthArray=fc[fc.find(START_OF_ARRAY)+len(START_OF_ARRAY):]
dataLengthArray=dataLengthArray[:dataLengthArray.find("}")]
offsets = [int(s) for s in dataLengthArray.split(",")]
outFile.write("extern uint8 array[2000];\n")
outFile.write("uint8* pointer_array[] PROGMEM = {\n")
sum = 0
for offset in offsets:
outFile.write("array + {}, ".format(sum))
sum=sum+offset
outFile.write("};")
Which would output PointerArray.c:
extern uint8 array[2000];
uint8* pointer_array[] = {
array + 0, array + 2, array + 6, array + 11, array + 12, array + 14, };
You could run the script as a Pre-build event, if your IDE supports it. Otherwise you will have to remember to run the script every time you update the offsets.
You mention that the data set lengths are pre-defined, but not how they are defined - so I'm going to make the assumption of how the lengths are written into code is up for grabs..
If you define your flash array in terms of offsets instead of lengths, you should immediately get a run-time benefit.
With lengths in flash, I expect you have something like this:
const uint8_t lengths[] = {1, 5, 9, ...};
uint8_t get_data_set_length(uint16_t index)
{
return lengths[index];
}
uint8_t * get_data_set_pointer(uint16_t index)
{
uint16_t offset = 0;
uint16_t i = 0;
for ( i = 0; i < index; ++i )
{
offset += lengths[index];
}
return &(array[offset]);
}
With offsets in flash, the const array has gone from uint8_t to uint16_t, which doubles the flash usage, plus an additional element to be speed up calculating the length of the last element.
const uint16_t offsets[] = {0, 1, 6, 15, ..., /* last offset + last length */ };
uint8_t get_data_set_length(uint16_t index)
{
return offsets[index+1] - offsets[index];
}
uint8_t * get_data_set_pointer(uint16_t index)
{
uint16_t offset = offsets[index];
return &(array[offset]);
}
If you can't afford that extra flash memory, ou could also combine the two by having the lengths for all elements and offsets for a fraction of the indices, e.g every 16 element in the example below, trading off run-time cost vs flash memory cost.
uint8_t get_data_set_length(uint16_t index)
{
return lengths[index];
}
uint8_t * get_data_set_pointer(uint16_t index)
{
uint16_t i;
uint16_t offset = offsets[index / 16];
for ( i = index & 0xFFF0u; i < index; ++i )
{
offset += lengths[index];
}
return &(array[offset]);
}
To simplify the encoding, you can consider using x-macros, e.g.
#define DATA_SET_X_MACRO(data_set_expansion) \
data_set_expansion( A, 1 ) \
data_set_expansion( B, 5 ) \
data_set_expansion( C, 9 )
uint8_t array[2000];
#define count_struct(tag,len) uint8_t tag;
#define offset_struct(tag,len) uint8_t tag[len];
#define offset_array(tag,len) (uint16_t)(offsetof(data_set_offset_struct,tag)),
#define length_array(tag,len) len,
#define pointer_array(tag,len) (&(array[offsetof(data_set_offset_struct,tag)])),
typedef struct
{
DATA_SET_X_MACRO(count_struct)
} data_set_count_struct;
typedef struct
{
DATA_SET_X_MACRO(offset_struct)
} data_set_offset_struct;
const uint16_t offsets[] =
{
DATA_SET_X_MACRO(offset_array)
};
const uint16_t lengths[] =
{
DATA_SET_X_MACRO(length_array)
};
uint8_t * const pointers[] =
{
DATA_SET_X_MACRO(pointer_array)
};
The preprocessor turns that into:
typedef struct
{
uint8_t A;
uint8_t B;
uint8_t C;
} data_set_count_struct;
typedef struct
{
uint8_t A[1];
uint8_t B[5];
uint8_t C[9];
} data_set_offset_struct;
typedef struct
{
uint8_t A[1];
uint8_t B[5];
uint8_t C[9];
} data_set_offset_struct;
const uint16_t offsets[] = { 0,1,6, };
const uint16_t lengths[] = { 1,5,9, };
uint8_t * const pointers[] =
{
array+0,
array+1,
array+6,
};
This just shows an example of what the x-macro can expand to. A short main() can show these in action:
int main()
{
printf("There are %d individual data sets\n", (int)sizeof(data_set_count_struct) );
printf("The total size of the data sets is %d\n", (int)sizeof(data_set_offset_struct) );
printf("The data array base address is %x\n", array );
int i;
for ( i = 0; i < sizeof(data_set_count_struct); ++i )
{
printf( "elem %d: %d bytes at offset %d, or address %x\n", i, lengths[i], offsets[i], pointers[i]);
}
return 0;
}
With sample output
There are 3 individual data sets
The total size of the data sets is 15
The data array base address is 601060
elem 0: 1 bytes at offset 0, or address 601060
elem 1: 5 bytes at offset 1, or address 601061
elem 2: 9 bytes at offset 6, or address 601066
The above require you to give a 'tag' - a valid C identifier for each data set, but if you have 500 of these, pairing each length with a descriptor is probably not a bad thing. With that amount of data, I would also recommend using an include file for the x-macro, rather than a #define, in particular if the data set definitions can be exported somewhere else.
The benefit of this approach is that you have the data sets defined in one place, and everything is generated from this one definition. If you re-order the definition, or add to it, the arrays will be generated at compile-time. It is also purely using the compiler toolchain, in particular the pre-processor, but there's no need for writing external scripts or hooking in pre-build scripts.
You said that you want to store the address of each data set but it seems like it would be much simpler if you store the offset of each data set. Storing the offsets instead of the addresses means that you don't need to know the address of big array at compile time.
Right now you have an array of constants containing the length of each data set.
const uint8_t data_set_lengths[] = { 1, 5, 9...};
Just change that to be an array of constants containing the offset of each data set in the big array.
const uint8_t data_set_offsets[] = { 0, 1, 6, 15, ...};
You should be able to calculate these offsets at design time given that you already know the lengths. You said yourself, just accumulate the lengths to get the offsets.
With the offsets precalculated the code won't have the bad performance of accumulating at run time. And you can find the address of any data set at run time simply by adding the data set's offset to the address of the big array. And the address of big array doesn't need to be settled until link time.
I dump my RAM (a piece of it - code segment only) in order to find where is which C function being placed. I have no map file and I don't know what boot/init routines exactly do.
I load my program into RAM, then if I dump the RAM, it is very hard to find exactly where is what function. I'd like to use different patterns build in the C source, to recognize them in the memory dump.
I've tryed to start every function with different first variable containing name of function, like:
char this_function_name[]="main";
but it doesn't work, because this string will be placed in the data segment.
I have simple 16-bit RISC CPU and an experimental proprietary compiler (no GCC or any well-known). The system has 16Mb of RAM, shared with other applications (bootloader, downloader). It is almost impossible to find say a unique sequence of N NOPs or smth. like 0xABCD. I would like to find all functions in RAM, so I need unique identificators of functions visible in RAM-dump.
What would be the best pattern for code segment?
If it were me, I'd use the symbol table, e.g. "nm a.out | grep main". Get the real address of any function you want.
If you really have no symbol table, make your own.
struct tab {
void *addr;
char name[100]; // For ease of searching, use an array.
} symtab[] = {
{ (void*)main, "main" },
{ (void*)otherfunc, "otherfunc" },
};
Search for the name, and the address will immediately preceed it. Goto address. ;-)
If your compiler has inline asm you can use it to create a pattern. Write some NOP instructions which you can easily recognize by opcodes in memory dump:
MOV r0,r0
MOV r0,r0
MOV r0,r0
MOV r0,r0
How about a completely different approach to your real problem, which is finding a particular block of code: Use diff.
Compile the code once with the function in question included, and once with it commented out. Produce RAM dumps of both. Then, diff the two dumps to see what's changed -- and that will be the new code block. (You may have to do some sort of processing of the dumps to remove memory addresses in order to get a clean diff, but the order of instructions ought to be the same in either case.)
Numeric constants are placed in the code segment, encoded in the function's instructions. So you could try to use magic numbers like 0xDEADBEEF and so on.
I.e. here's the disassembly view of a simple C function with Visual C++:
void foo(void)
{
00411380 push ebp
00411381 mov ebp,esp
00411383 sub esp,0CCh
00411389 push ebx
0041138A push esi
0041138B push edi
0041138C lea edi,[ebp-0CCh]
00411392 mov ecx,33h
00411397 mov eax,0CCCCCCCCh
0041139C rep stos dword ptr es:[edi]
unsigned id = 0xDEADBEEF;
0041139E mov dword ptr [id],0DEADBEEFh
You can see the 0xDEADBEEF making it into the function's source. Note that what you actually see in the executable depends on the endianness of the CPU (tx. Richard).
This is a x86 example. But RISC CPUs (MIPS, etc) have instructions moving immediates into registers - these immediates can have special recognizable values as well (although only 16-bit for MIPS, IIRC).
Psihodelia - it's getting harder and harder to catch your intention. Is it just a single function you want to find? Then can't you just place 5 NOPs one after another and look for them? Do you control the compiler/assembler/linker/loader? What tools are at your disposal?
As you noted, this:
char this_function_name[]="main";
... will end up setting a pointer in your stack to a data segment containing the string. However, this:
char this_function_name[]= { 'm', 'a', 'i', 'n' };
... will likely put all these bytes in your stack so you will be able to recognize the string in your code (I just tried it on my platform).
Hope this helps
Why not get each function to dump its own address. Something like this:
void* fnaddr( char* fname, void* addr )
{
printf( "%s\t0x%p\n", fname, addr ) ;
return addr ;
}
void test( void )
{
static void* fnaddr_dummy = fnaddr( __FUNCTION__, test ) ;
}
int main (int argc, const char * argv[])
{
static void* fnaddr_dummy = fnaddr( __FUNCTION__, main ) ;
test() ;
test() ;
}
By making fnaddr_dummy static, the dump is done once per-function. Obviously you would need to adapt fnaddr() to support whatever output or logging means you have on your system. Unfortunately, if the system performs lazy initialisation, you'll only get the addresses of the functions that are actually called (which may be good enough).
You could start each function with a call to the same dummy function like:
void identifyFunction( unsigned int identifier)
{
}
Each of your functions would call the identifyFunction-function with a different parameter (1, 2, 3, ...). This will not give you a magic mapfile, but when you inspect the code dump you should be able to quickly find out where the identifyFunction is because there will be lots of jumps to that address. Next scan for those jump and check before the jump to see what parameter is passed. Then you can make your own mapfile. With some scripting this should be fairly automatic.