variable length array declaration not allowed in OpenCL - why? - arrays

I want to create a local array inside my OpenCL kernel, whose size depends on a parameter of the kernel. It seems that's not allowed - at least with AMD APP.
Is your experience different? Perhaps it's just the APP? Or is is there some rationale here?
Edit: I would now suggest variable length arrays should be allowed in CPU-side code too, and it was an unfortunate call by the C standard committee; but the question stands.

You can dynamically allocate the size of a local block. You need to take it as a parameter to your kernel, and define its size when you call clSetKernelArg.
definition example:
__kernel void kernelName(__local float* myLocalFloats, ...)
host code:
clSetKernelArg(kernel, 0, myLocalFloatCount * sizeof(float), NULL); // <-- set the size to the correct number of bytes to allocate, but use NULL for the data.
Make sure you know what the limit for local memory is on your device before you do this. Call clGetDeviceInfo, and poll for the 'CL_DEVICE_LOCAL_MEM_SIZE' value.

Not sure why people are saying you can't do this as it is something many people do with OpenCL (Yes, I understand it's not exactly the same but it works well enough for many cases).
Since OpenCL kernels are compiled at runtime and, just like text, you can just simply set the size to whatever size you want and then recompile your kernel. This obviously won't be perfect in cases where you have huge variability in sizes but usually I compile several different sizes at startup and then just call the correct one as needed (in your case based on the kernel argument). If I get a new size I don't have a kernel for I will compile it right then and cache the kernel in case it comes up again.

Related

OpenCL - clCreateBuffer size error. Possible work arounds?

After investigating the reason why my program was crashing, I found that I was hitting the maximum for a buffer size, which is 512Mb for me (CL_DEVICE_MAX_MEM_ALLOC_SIZE).
In my case, here are the parameters.
P = 146 (interpolation factor)
num_items = 918144 (number of samples)
sizeof(float) -> 4
So my clCreateBuffer looks something like this:
output = clCreateBuffer(
context,
CL_MEM_READ_ONLY,
num_items * P * sizeof(float),
NULL,
&status);
When the above is multiplied together and divided by (1024x1024), you get around 511Mb which is under the threshold. Change any of the parameters to one higher now and it crashes because it will exceed that 512 value.
My questions is, how can I implement the code in a way where I can use block sizes to do my calculations instead of storing everything in memory and passing that massive chunk of data to the kernel? In reality, the number of samples I have could easily vary to over 5 million and I definitely will not have enough memory to store all those values.
I'm just not sure how to pass small sets of values into my kernel as I have three steps that the values go though before getting an output.
First is an interpolation kernel, then the values go to a lowpass filter kernel and then to a kernel that does decimation. After that the values are written to an output array. If further details of the program are needed for the sake of the problem I can add more.
UPDATE
Not sure what the expected answer is here, if anyone has a reason I would love to hear it and potentially accept it as the valid answer. I don't work with OpenCL anymore so i don't have the setup to verify.
Looking at the OpenCL specification and clCreateBuffer I would say the solution here is allowing use of host memory by adding CL_MEM_USE_HOST_PTR to flags (or whatever suits your use case). Paragraphs from CL_MEM_USE_HOST_PTR:
This flag is valid only if host_ptr is not NULL. If specified, it
indicates that the application wants the OpenCL implementation to use
memory referenced by host_ptr as the storage bits for the memory
object.
The contents of the memory pointed to by host_ptr at the time
of the clCreateBuffer call define the initial contents of the buffer
object.
OpenCL implementations are allowed to cache the buffer
contents pointed to by host_ptr in device memory. This cached copy can
be used when kernels are executed on a device.
What this means is the driver will pass memory between host and device in the most efficient way it can. Basically what you propose yourself in comments, except it is already built into the driver, activated with a single flag, and probably more efficient than anything you can come up with.

Maximum stack size needed for a C program on MSP430

In a C program that doesn't use recursion, it should be possible in theory to work out the maximum/worst case stack size needed to call a given function, and anything that it calls. Are there any free, open source tools that can do this, either from the source code or compiled ELF files?
Alternatively, is there a way to extract a function's stack frame size from an ELF file, so I can try to work it out manually?
I'm compiling for the MSP430 using MSPGCC 3.2.3 (I know it's an old version, but I have to use it in this case). The stack space to allocate is set in the source code, and should be as small as possible so that the rest of memory can be used for other things. I have read that you need to take account of the stack space used by interrupts, but the system I'm using already takes account of this - I'm trying to work out how much extra space to add on top of that. Also, I've read that function pointers make this difficult. In the few places where function pointers are used here, I know which functions they can call, so could take account of these cases manually if the stack space needed for the called functions and the calling functions was known.
Static analysis seems like a more robust option than stack painting at runtime, but working it out at runtime is an option if there's no good way to do it statically.
Edit:
I found GCC's -fstack-usage flag, which saves the frame size for each function as it is compiled. Unfortunately, MSPGCC doesn't support it. But it could be useful for anyone who is trying to do something similar on a different platform.
While static analysis is the best method for determining maximum stack usage you may have to resort to an experimental method. This method cannot guarantee you an absolute maximum but can provide you with a very good idea of your stack usage.
You can check your linker script to get the location of __STACK_END and __STACK_SIZE. You can use these to fill the stack space with an easily recognizable pattern like 0xDEAD or 0xAA55. Run your code through a torture test to try and make sure as many interrupts are generated as possible.
After the test you can examine the stack space to see how much of the stack was overwritten.
Interesting question.
I would expect this information to be statically available in the debugging data included in debug builds.
I had a brief look at the DWARF standard, and it does specify two attributes for functions called DW_AT_frame_base and DW_AT_static_link which can be used to "computes the frame
base of the relevant instance of the subroutine
that immediately encloses the subroutine or entry point".
I think that the only to go is by static analysis. You need to account the space for all non-static local variables, which are going to be mostly pointers, but pointers that are going to be stored in the stack anyway, you'll need also to reserve space for the current running address within the caller, as it's going to be stored by the compiler on the stack so control can be return to the caller after your function returns, and also, you need space for all your function parameters.
Based on that, if you have a tool able to count all parameters, auto variables and figure out their size, you should be able to calculate the minimum stack frame size you'll need.
Please note that the compiler could also try to align values on the stack for your particular architecture, what could make the stack space requirements a little bigger that what you'd expect from this calculation.
Some embedded IDE can give info on stack usageduring runtime
I know that IAR eembedded workbench supports it.
Be aware that you need to take in account that interrupts occur asynchronously, so take the biggest stack usage scenario and add interrupt context to it. If nested interrupts are supported like in ARM processors you need to take this in account also.
TinyOS has some work done on stack size analysis. It is described here:
http://tinyos.stanford.edu/tinyos-wiki/index.php/Stack_Analysis
They only support AVR, but say that "MSP430 is not difficult to support but this is not super high priority". In any case, the page provides lots of resources.

Porting user space code to kernel space

I have a big system written mostly in C that was running in user space up till now. Now I need to compile the code as a kernel module. For that, afaik, I should at least rewrite the code and replace functions as malloc, calloc, free, printf with their kernel equivalents, because those are solely user-space functions. The problem is, however, that I don't have the source code to some custom-made libraries used in the system, and those libraries call malloc etc. inside their functions. So, basically, I might need to reimplement the whole library.
Now the question: will it be a really dirty hack, if I'd write my own implementation of malloc as a wrapper around kmalloc, something like this:
void *malloc(size_t size) {
return kmalloc(size, GFP_USER);
}
Then link this implementation to the system code, which will eliminate all the Unknown symbol in module errors.
Actually I thought that this would be a common problem and someone would have already written such a kmalloc wrapper, but I've been googling for a couple of days now and found nothing useful.
EDIT: The reason for doing this is that the system I'm talking about was a realtime application running on VxWorks realtime OS and now we want to port it to be used on Linux RTAI, where the apps mostly run in kernel space. But I guess there is a possibility to have real-time in user space as well, so, I should probably do as Mike suggested and separate the code into kernel and user-space parts and communicate between them with shared memory.
I've never seen this done before. I did have to do something similar at a previous job (in our phones, for power savings reasons, we had to port a portion of code from user space from the kernel) but that's how I did it.. I took a portion of the code and moved it, and a small portion at that.
When I did it I changed the user space calls to kernel calls because of a number of reasons two primary ones:
It was less confusing that way (others looking at the code didn't have to wonder why I was calling "malloc" from the kernel)
malloc and kmalloc don't work exactly the same. What I mean by that is
2a. kmalloc takes a flags parameter, in your example above you hardcoded it. What if you decide later that you want to change it in some places and not others? (assuming you have a number of different places where you get dynamic memory).
2b. kmalloc doesn't give you memory in the same way as malloc. malloc() will give you the number of bytes you pass in as size_t size. kmalloc() on the other hand, is in the kernel and thus is dealing with the physical memory of the system, which is available only in page-sized chunks; thus when you call kmalloc() you are going to get only certain predefined, fixed-size byte arrays. if you're not aware of this, you might ask for just over a particular chunk and thus get much more memory than you need... a direct port of your code won't protect you from that.
2c. The header files have to change too. Obviously you can't include <stdlib.h> in the kernel, so just because you "wrapped" the malloc call, you still have to go around replacing header files.
quick example of my point in 2b above:
void * stuff;
stuff = kmalloc(1,GFP_KERNEL);
printk("I got: %zu bytes of memory\n", ksize(stuff));
kfree(stuff);
To show the actual amount of memory allocated:
[90144.702588] I got: 32 bytes of memory
anyway... technically, how you describe it, should work fine. Both take a size_t and return a void * so it should work; but be aware that the more code you move into the kernel the less deterministic things become, and that malloc()<=>kmalloc() isn't as 1:1 as it seems.
Trying to make my RTAI code compilable in both user and kernel spaces (as well as working with POSIX), I have developed URT which essentially does what you are asking. It's a lightweight abstraction level over real-time systems (and even over the inconsistent user-space vs kernel-space RTAI functions).

Best Practices for PIC18 Stack/Memory Management?

The limited stack size of budget PICs is a problem area and I have adjusted my code to accommodate this reality. I currently adopt a rough paradigm of grouping closely related functions into a module and declaring all variables global static in the module (to reduce the amount of variables stored in the auto psect, and issues of mutability are only relevant in ISRs, which I account for.) I don't do this because it is good practice, but the reality is you have a finite amount of space to allocate all local function vars that exist in an entire project. In the embedded world of 8/16 bit chips, is this an appropriate method, provided I'm sure to take necessary precautions? I also do things like allocate > 256 bytes of RAM for Ethernet (I know it should be 1500 as standard MTU, but we have a custom situation and very limited RAM) buffers and have to access that memory via pointers so I can avoid the semantics of memory banking. Am I doing it wrong? My app works, but I am 100% open to suggestions for improvement. [c]
I know this was asked 4 years ago but it still has not been properly answered. I believe what the OP is asking is is their approach to working around a limitation of the HiTech PICC18 C compiler valid and/or best practice. As mentioned in a later comment the limitation (a rather bad one and not well advertised by Hitech) is "the Hi-Tech compiler only allows up 256 bytes of auto variables". Actually the limitation is worse than that as it is a total of 256 bytes for local variables and parameters. The linker warning when this is exceeded is pretty cryptic too. Provided that functions are on different branches of the call tree then the compiler can overlap the variables to reuse the space. This means that you can effectively have more than 256 bytes. But note that the interrupt handler (or handlers if you use the priority scheme) has it's own call tree that shares the 256 byte local/param block.
Locals
The two solutions to reduce the space required for locals are: make the locals global or make them static. Making them static keeps the scope the same and provided the function is not called from interrupts is safe (rentrancy is not allowed by the compiler anyway). This is probably the preferred option. The drawback is that the compiler can not reuse those variable's locations to reduce overall memory consumption. Moving the variables to global scope allows reuse, but the reuse management must be managed by the programmer. Probably the best balance is to make simple variables static but to make large chunks of memory like string buffers global and carefully reuse them.
Be careful with initialisation.
foo()
{
int myvar = 5;
}
must change to
foo()
{
static int myvar;
myvar = 5;
}
Parameters
If you go around passing large lots of data down the call tree in parameters you will quickly run into the same 256 byte limitation. Your best option here may be to pass a pointer to a globally allocated struct/s of "options".Alternatively you can have global settings variables that are set by the top caller and read by callees down the tree. It really depends on the design of the software which approach is better.
I've struggled with the same issues as the OP and I think the best option in the long run is to move away from using the Hitech compiler. The optimisation decision the compiler writers took to allocate all locals/params in one block is only really appropriate for the very small ram size PICS. For large PICS you will run out of local/param far before you hit the ram size of the device. Then you have to start hacking your code around to fit the compiler which is perverse.
In summary... Yes your approach is valid. But do consider simply making locals static if that is appropriate as, in general, reducing the scope makes your code safer.
Whereas the C18 compiler used some FSRs (pointers) to manage the data stack, it sounds like the new XC8 compiler from Microchip uses a compiled stack, so you should know exactly how much space is taken up by the stack at compile time. You will also know exactly where each stack variable is stored. I read all about this in the XC8 user's guide and it sounds great. That feature should make this question be moot, assuming you are using XC8.
My experience with compilers/linkers for chips with limited memory is that, as long as you don't use recursive functions and inform the compiler about that, then the compiler is very capable of determining the minimal amount of stack-space that is needed.
I have even seen compilers that give each variable with automatic storage a globally fixed address (no stack at all), where several variables got allocated to overlapping memory, as long as their lifetimes did not overlap.
The general advise when doing (speed or space) optimisations is: make measurements to prove that your optimisation actually has a positive effect.
Since you are nearly out of memory, you have to count each byte of RAM. Using local variables (auto) allows to reuse the memory where you need it (local in the function). When you move the variables to global static address space, you give each variable a unique space. That's wast of address space.
The Microchip compiler allows that different variables share the same address. I don't have the docs at hand, but this can be done by pragma.
But what you need is a analysis of RAM requirements. When you see, that the stack cannot hold all variables but the auto variables would reduce the global memory use, you should consider to increase the stack size using startup code and the linker script.
Best practive is to choose a hardware that fits the requirements.
There are microcontrollers around the cost only some dollars more, but save hundereds or thousand of dollars development costs. If this is a hobby development your effort may not count. But in real world you can often find hardware that is designed only with view of hardware costs.
Especially the PIC18 is not the best example for compact code, what also can be a problem with the flash memory.
This migth sound obvious, but try not to use 16 bits variables on 8 bit precessors. 16 bits variables are fine and needed on bigger arquitectures, but in limited (8 bit) architectures a 16 bit aritmetic is a quick way for depleting both RAM and ROM memories in no time.
If you try to increment a 16 bits variable, the compiler would include a 16 bits increment library, that consumes in most cases a lot of space.
Also, try not to divide or multiply, as for some controllers they are software implemented.
Personally, I go alwais for char and when in need of a divide operation, use rotate rigth 'n' times to divide by 2 n times.
hope this helps!
A bit late, but you should also have a closer look at the C18 compiler user guide (if you were using this compiler).
You could decrease the stack dramatically by statically allocating local variables (overriding the auto keyword). Even better, you can use the overlay storage identifier, which allows different non-overlapping lifetimes variables to be placed at the same address, minimizing RAM. (C18 compiler must operate in Non-Extended mode).

C memcpy() a function

Is there any method to calculate size of a function? I have a pointer to a function and I have to copy entire function using memcpy. I have to malloc some space and know 3rd parameter of memcpy - size. I know that sizeof(function) doesn't work. Do you have any suggestions?
Functions are not first class objects in C. Which means they can't be passed to another function, they can't be returned from a function, and they can't be copied into another part of memory.
A function pointer though can satisfy all of this, and is a first class object. A function pointer is just a memory address and it usually has the same size as any other pointer on your machine.
It doesn't directly answer your question, but you should not implement call-backs from kernel code to user-space.
Injecting code into kernel-space is not a great work-around either.
It's better to represent the user/kernel barrier like a inter-process barrier. Pass data, not code, back and forth between a well defined protocol through a char device. If you really need to pass code, just wrap it up in a kernel module. You can then dynamically load/unload it, just like a .so-based plugin system.
On a side note, at first I misread that you did want to pass memcpy() to the kernel. You have to remind that it is a very special function. It is defined in the C standard, quite simple, and of a quite broad scope, so it is a perfect target to be provided as a built-in by the compiler.
Just like strlen(), strcmp() and others in GCC.
That said, the fact that is a built-in does not impede you ability to take a pointer to it.
Even if there was a way to get the sizeof() a function, it may still fail when you try to call a version that has been copied to another area in memory. What if the compiler has local or long jumps to specific memory locations. You can't just move a function in memory and expect it to run. The OS can do that but it has all the information it takes to do it.
I was going to ask how operating systems do this but, now that I think of it, when the OS moves stuff around it usually moves a whole page and handles memory such that addresses translate to a page/offset. I'm not sure even the OS ever moves a single function around in memory.
Even in the case of the OS moving a function around in memory, the function itself must be declared or otherwise compiled/assembled to permit such action, usually through a pragma that indicates the code is relocatable. All the memory references need to be relative to its own stack frame (aka local variables) or include some sort of segment+offset structure such that the CPU, either directly or at the behest of the OS, can pick the appropriate segment value. If there was a linker involved in creating the app, the app may have to be
re-linked to account for the new function address.
There are operating systems which can give each application its own 32-bit address space but it applies to the entire process and any child threads, not to an individual function.
As mentioned elsewhere, you really need a language where functions are first class objects, otherwise you're out of luck.
You want to copy a function? I do not think that this is possible in C generally.
Assume, you have a Harvard-Architecture microcontroller, where code (in other words "functions") is located in ROM. In this case you cannot do that at all.
Also I know several compilers and linkers, which do optimization on file (not only function level). This results in opcode, where parts of C functions are mixed into each other.
The only way which I consider as possible may be:
Generate opcode of your function (e.g. by compiling/assembling it on its own).
Copy that opcode into an C array.
Use a proper function pointer, pointing to that array, to call this function.
Now you can perform all operations, common to typical "data", on that array.
But apart from this: Did you consider a redesign of your software, so that you do not need to copy a functions content?
I don't quite understand what you are trying to accomplish, but assuming you compile with -fPIC and don't have your function do anything fancy, no other function calls, not accessing data from outside function, you might even get away with doing it once. I'd say the safest possibility is to limit the maximum size of supported function to, say, 1 kilobyte and just transfer that, and disregard the trailing junk.
If you really needed to know the exact size of a function, figure out your compiler's epilogue and prologue. This should look something like this on x86:
:your_func_epilogue
mov esp, ebp
pop ebp
ret
:end_of_func
;expect a varying length run of NOPs here
:next_func_prologue
push ebp
mov ebp, esp
Disassemble your compiler's output to check, and take the corresponding assembled sequences to search for. Epilogue alone might be enough, but all of this can bomb if searched sequence pops up too early, e.g. in the data embedded by the function. Searching for the next prologue might also get you into trouble, i think.
Now please ignore everything that i wrote, since you apparently are trying to approach the problem in the wrong and inherently unsafe way. Paint us a larger picture please, WHY are you trying to do that, and see whether we can figure out an entirely different approach.
A similar discussion was done here:
http://www.motherboardpoint.com/getting-code-size-function-c-t95049.html
They propose creating a dummy function after your function-to-be-copied, and then getting the memory pointers to both. But you need to switch off compiler optimizations for it to work.
If you have GCC >= 4.4, you could try switching off the optimizations for your function in particular using #pragma:
http://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html#Function-Specific-Option-Pragmas
Another proposed solution was not to copy the function at all, but define the function in the place where you would want to copy it to.
Good luck!
If your linker doesn't do global optimizations, then just calculate the difference between the function pointer and the address of the next function.
Note that copying the function will produce something which can't be invoked if your code isn't compiled relocatable (i.e. all addresses in the code must be relative, for example branches; globals work, though since they don't move).
It sounds like you want to have a callback from your kernel driver to userspace, so that it can inform userspace when some asynchronous job has finished.
That might sound sensible, because it's the way a regular userspace library would probably do things - but for the kernel/userspace interface, it's quite wrong. Even if you manage to get your function code copied into the kernel, and even if you make it suitably position-independent, it's still wrong, because the kernel and userspace code execute in fundamentally different contexts. For just one example of the differences that might cause problems, if a page fault happens in kernel context due to a swapped-out page, that'll cause a kernel oops rather than swapping the page in.
The correct approach is for the kernel to make some file descriptor readable when the asynchronous job has finished (in your case, this file descriptor almost certainly be the character device your driver provides). The userspace process can then wait for this event with select / poll, or with read - it can set the file descriptor non-blocking if wants, and basically just use all the standard UNIX tools for dealing with this case. This, after all, is how the asynchronous nature of network sockets (and pretty much every other asychronous case) is handled.
If you need to provide additional information about what the event that occured, that can be made available to the userspace process when it calls read on the readable file descriptor.
Function isn't just object you can copy. What about cross-references / symbols and so on? Of course you can take something like standard linux "binutils" package and torture your binaries but is it what you want?
By the way if you simply are trying to replace memcpy() implementation, look around LD_PRELOAD mechanics.
I can think of a way to accomplish what you want, but I won't tell you because it's a horrific abuse of the language.
A cleaner method than disabling optimizations and relying on the compiler to maintain order of functions is to arrange for that function (or a group of functions that need copying) to be in its own section. This is compiler and linker dependant, and you'll also need to use relative addressing if you call between the functions that are copied. For those asking why you would do this, its a common requirement in embedded systems that need to update the running code.
My suggestion is: don't.
Injecting code into kernel space is such an enormous security hole that most modern OSes forbid self-modifying code altogether.
As near as I can tell, the original poster wants to do something that is implementation-specific, and so not portable; this is going off what the C++ standard says on the subject of casting pointers-to-functions, rather than the C standard, but that should be good enough here.
In some environments, with some compilers, it might be possible to do what the poster seems to want to do (that is, copy a block of memory that is pointed to by the pointer-to-function to some other location, perhaps allocated with malloc, cast that block to a pointer-to-function, and call it directly). But it won't be portable, which may not be an issue. Finding the size required for that block of memory is itself dependent on the environment, and compiler, and may very well require some pretty arcane stuff (e.g., scanning the memory for a return opcode, or running the memory through a disassembler). Again, implementation-specific, and highly non-portable. And again, may not matter for the original poster.
The links to potential solutions all appear to make use of implementation-specific behaviour, and I'm not even sure that they do what the purport to do, but they may be suitable for the OP.
Having beaten this horse to death, I am curious to know why the OP wants to do this. It would be pretty fragile even if it works in the target environment (e.g., could break with changes to compiler options, compiler version, code refactoring, etc). I'm glad that I don't do work where this sort of magic is necessary (assuming that it is)...
I have done this on a Nintendo GBA where I've copied some low level render functions from flash (16 bit access slowish memory) to the high speed workspace ram (32 bit access, at least twice as fast). This was done by taking the address of the function immdiately after the function I wanted to copy, size = (int) (NextFuncPtr - SourceFuncPtr). This did work well but obviously cant be garunteed on all platforms (does not work on Windows for sure).
I think one solution can be as below.
For ex: if you want to know func() size in program a.c, and have indicators before and after the function.
Try writing a perl script which will compile this file into object format(cc -o) make sure that pre-processor statements are not removed. You need them later on to calculate the size from object file.
Now search for your two indicators and find out the code size in between.

Resources