memcpy [or not?] and multithreading [std::thread from c++11] - c

I writing a software in C/C++ using a lot BIAS/Profil, an interval algebra library. In my algorithm I have a master which divide a domain and feeds parts of it to slave process(es). Those return an int statute about those domain parts. There is common data for reading and that's it.
I need to parallelize my code, however as soon as 2 slave-threads are running (or more I guess) and are both calling functions of this library, it segfaults. What is peculiar about those segfaults, is that gdb rarely indicates the same error line from two builds: it depends on the speed of the threads, if one started earlier, etc. I've tried having the threads yield until a go-ahead from the master, it 'stabilize' the error. I'm fairly sure that it comes from the calls to memcpy of the library (following the gdb backtrace, I always end-up on a BIAS/Profil function calling a memcpy. To be fair, almost all functions call a memcpy to a temporary object before returning the result...). From what I read on the web, it would appear that memcpy() could be not thread-safe, depending on the implementations (especially here). (It seems weird for a function supposed to only read the shared data... or maybe when writing the thread-wise data both threads go for the same memory space?)
To try to address this, I'd like to 'replace' (at least for tests if behavior changes) the call to memcpy for a mutex-framed call. (something like mtx.lock();mempcy(...);mtx.unlock();)
1st question: I'm not a dev/code engineer at all, and lack of lot of base knowledge. I think that as I use a pre-built BIAS/Profil library, the memcpy called is the one of the system the library was built on, correct? If so, would it change anything were I to try building the library from source on my system? (I'm not sure I can build this library hence the question.)
2nd question:
in my string.h, memcpy is declared by:
#ifndef __HAVE_ARCH_MEMCPY
extern void * memcpy(void *,const void *,__kernel_size_t); #endif and in some other string headers (string_64.h, string_32.h) a definition of the form: #define memcpy(dst, src, len) __inline_memcpy((dst), (src), (len)) or some more explicit definition, or just a declaration like the one quoted.
It's starting to get ugly but, ideally, I'd like to create a pre-processor variable #define __HAVE_ARCH_MEMCPY 1, and a void * memcpy(void *,const void *,__kernel_size_t) which would do the mutex-framed memcpy with the the dismissed memcpy.
The idea here is to avoid messing with the library and make it work with 3 lines of code ;)
Any better idea? (it would make my day...)

IMHO you shouldn't concentrate to the memcpy()s, but to the higher level funktionality.
And memcpy() is thread-safe if the handled memory intervals of the parallel running threads don't overlap. Practically, in the memcpy() is there only a for(;;) loop (with a lot of optimizations) [at least in glibc], it is the cause, why is it in declared.
If you want to know, what your parallel memcpy()-ing threads will do, you should imagine the for(;;) loops which copy memory through longint-pointers.

Given that your observations, and that the Profil lib is from the last millennium, and that the documentation (homepage and Profil2.ps) do not even contain the word "thread", I would assume that the lib is not thread safe.
1st: No, usually memcpy is part of libc which is dynamically linked (at least nowadays). On linux, check with ldd NAMEOFBINARY, which should give a line with something like libc.so.6 => /lib/i386-linux-gnu/libc.so.6 or similar. If not: rebuild. If yes: rebuilding could help anyway, as there are many other factors.
Besides this, I think memcpy is thread safe as long as long as you do never write back data (even writing back unmodified data will hurt: https://blogs.oracle.com/dave/entry/memcpy_concurrency_curiosities).
2nd: If it turns out that you have to use a modified memcpy, also think about LD_PRELOAD.

In general, you must use a critical section, mutex, or some other protection technique to keep multiple threads from accessing non thread safe (non- re-entrant) functions simultaneously. Some ANSI C implementations of memcpy() are not thread safe, some are. ( safe, not safe )
Writing functions that are thread-safe, and/or writing threaded programs that can safely accommodate non thread safe functions is a substantial topic. Very doable, but requires reading up on the topic. There is much written. This, will at least help you to start asking the right questions.

Related

malloc alternative for interrupt safety

Once again, Im not a developer but a wood worker so my questions might be, well, stupid.
I forgot something really important I've to use gcc-3.8 to compile as the original code I'm working with can't compile with newer version. I totaly forgot to talk about that, sorry.
I'm sending data from a tool to an autonomous robot.
Robot receive data as unsigned char*
I read a lot and it seems that malloc isn't interrupt safe.
As the robot could do bad & dangerous things, I try to make every parts safe (at least as much as I can).
This malloc happens after an interrupt raised by data received.
This code segment is now giving me an hard time to make it safe, also my syntax is probably bad..
char* _Xp = (char*) malloc(strlen((char*)_X)*sizeof(char));
strcpy(_Xp, (char*)_X);
1) is malloc really not interrupt safe? information I found are from 2004.
2) is there a more efficient way to initialize the buffer?
3) why are unsigned char "bad"? (read something about that).
4) last question is strcpy also not interrupt safe? sources I read differ on that point.
=====
answering some questions:
Robot doesn't have operating system, target is an STR911FFM44 at 25Mhz (if it helps)
Input arrays are all nul-terminated.
Code is not in the interruption handler but in the infinite loop and processed only if the IrHandler as set the flag for it.
I don't know the rate of the data stream to "hard code" the safety. but interruption should be in [500ms to 1500ms].
1) is malloc really not interrupt safe?
malloc accesses and modifys a global resource, the common memory pool of you running program. If the access happens from two unsynchronized places, such as your normal program flow and the ISR1, then it can mess up the pool. If your ISR doesn't call malloc itself it won't be a problem.
If it does, you'd need to put in place a system for preventing such reentry into malloc. For instance, wrap the call to malloc in a function that turns interrupt handling off, and then on again.
2) is there a more efficient way to initialize the buffer?
If you need a buffer with allocated storage duration (i.e. that you decide when its lifetime ends, and not the scope it's allocated in) then there isn't really a standard C alternative. By the way, sizeof(char) is always 1, so no need to specify it. And since C allows implicit conversion of pointer types from void*, the call can at least be prettified a bit2:
char* _Xp = malloc(strlen((char*)_X));
3) why are unsigned char "bad"?
They aren't bad. In fact when you need to know exactly whether or not the character type is signed or not, you have to use signed char or unsigned char. The regular char can be signed on one platform and unsigned on another.
1 Interrupt Service Routine.
2 C has a notion of reserved identifiers. In particular, any identifier the starts with an underscore followed by an uppercase letter is always reserved. So renaming your variables may help with portability.
First of all, you say that you are using a bare metal microcontroller, so malloc never makes sense. It is not a PC - you don't share your RAM with anyone else. So all the dangers and disadvantages of malloc don't even enter the discussion, since malloc makes no sense for you to use at all.
1) is malloc really not interrupt safe? information I found are from 2004.
4) last question is strcpy also not interrupt safe? sources I read differ on that point.
No function using shared resources between an ISR and the main application is interrupt safe. You should avoid calling library functions from an ISR, they should be kept minimal.
All data shared between an ISR and the caller has to be treated with care. You must ensure atomic access of individual objects. You must declare such variables as volatile to prevent optimizer bugs. You might have to use semaphores or other synchronization means. This applies to all such data, whether you change it by yourself or through a library function.
Failing to do all of the above will lead to very mysterious and subtle bugs, causing data corruption, race conditions or code that is never executed. Overall, interrupts are always hard to work with because of all this extra complexity. Only use them when your real-time requirements give you no other options.
2) is there a more efficient way to initialize the buffer?
Yes, use an array. static char _Xp [LARGE_ENOUGH_FOR_WORST_CASE]; Usually it is a good idea to keep such buffers in the .data segment rather than on the stack, hence the static keyword.
3) why are unsigned char "bad"? (read something about that).
There is nothing bad with them as such. The different char types are problematic though, because in theory they could have other sizes than 8 bits. Worse, char without signed/unsigned has implementation-defined signedness, meaning it might be signed or unsigned depending on compiler. Meaning that you should never use the char type for storing anything else but text strings.
If you need a variable type to hold bytes of data, always use uint8_t from stdint.h.
As the robot could do bad & dangerous things, I try to make every parts safe (at least as much as I can).
Writing safe software for embedded systems is a highly qualified task. I wouldn't recommend anyone with less than 5 years of experience working full-time with embedded firmware programming to even consider it, unless there's at least one hardened C veteran who is part of your team and all code passes through peer review and static analysis.
It sounds like you would benefit a lot from reading through the MISRA-C:2012 coding guidelines. It is a safe subset of the C language, intended to be used in safety-critical applications, or any form of application where bugs are bad. Unfortunately the MISRA-C document is not free, but it is becoming an industry standard.

xmlCleanupParser() memory loss?

as xmlCleanupParser() from the very good libxml2 is not thread-safe, my question is (and I have no possibility to check it out), how much Memory (rough number) is lost to xmlParseFile() and -more importantly- is this memory loss cumulating over many calls to xPF()?
Despite the fact, that malloc() and free() or whatever memory handling implementations are not necessarily thread safe in C < 11, there's always the problem of shared/global memory. File handles to the same file in different threads aren't that bad as long as they're read only.
However, starting with libxml2 2.4.7, you might be able to enable thread safety at the API level, for single threads per document: http://www.xmlsoft.org/threads.html
When I look at the sources of libxml2 2.9.1, I'm positive that thread safety is fully implemented, despite global mutexes, there's also an atomic allocation function.
Downloads:
ftp://xmlsoft.org/libxml2/
following the advice given by meaning-matters, and using the only tool, I found under OS2 (this ancient old IBM operating system) to check memory, there seams to be no difference in memory-loss between using xCP() or choosing not to (for me).

Porting user space code to kernel space

I have a big system written mostly in C that was running in user space up till now. Now I need to compile the code as a kernel module. For that, afaik, I should at least rewrite the code and replace functions as malloc, calloc, free, printf with their kernel equivalents, because those are solely user-space functions. The problem is, however, that I don't have the source code to some custom-made libraries used in the system, and those libraries call malloc etc. inside their functions. So, basically, I might need to reimplement the whole library.
Now the question: will it be a really dirty hack, if I'd write my own implementation of malloc as a wrapper around kmalloc, something like this:
void *malloc(size_t size) {
return kmalloc(size, GFP_USER);
}
Then link this implementation to the system code, which will eliminate all the Unknown symbol in module errors.
Actually I thought that this would be a common problem and someone would have already written such a kmalloc wrapper, but I've been googling for a couple of days now and found nothing useful.
EDIT: The reason for doing this is that the system I'm talking about was a realtime application running on VxWorks realtime OS and now we want to port it to be used on Linux RTAI, where the apps mostly run in kernel space. But I guess there is a possibility to have real-time in user space as well, so, I should probably do as Mike suggested and separate the code into kernel and user-space parts and communicate between them with shared memory.
I've never seen this done before. I did have to do something similar at a previous job (in our phones, for power savings reasons, we had to port a portion of code from user space from the kernel) but that's how I did it.. I took a portion of the code and moved it, and a small portion at that.
When I did it I changed the user space calls to kernel calls because of a number of reasons two primary ones:
It was less confusing that way (others looking at the code didn't have to wonder why I was calling "malloc" from the kernel)
malloc and kmalloc don't work exactly the same. What I mean by that is
2a. kmalloc takes a flags parameter, in your example above you hardcoded it. What if you decide later that you want to change it in some places and not others? (assuming you have a number of different places where you get dynamic memory).
2b. kmalloc doesn't give you memory in the same way as malloc. malloc() will give you the number of bytes you pass in as size_t size. kmalloc() on the other hand, is in the kernel and thus is dealing with the physical memory of the system, which is available only in page-sized chunks; thus when you call kmalloc() you are going to get only certain predefined, fixed-size byte arrays. if you're not aware of this, you might ask for just over a particular chunk and thus get much more memory than you need... a direct port of your code won't protect you from that.
2c. The header files have to change too. Obviously you can't include <stdlib.h> in the kernel, so just because you "wrapped" the malloc call, you still have to go around replacing header files.
quick example of my point in 2b above:
void * stuff;
stuff = kmalloc(1,GFP_KERNEL);
printk("I got: %zu bytes of memory\n", ksize(stuff));
kfree(stuff);
To show the actual amount of memory allocated:
[90144.702588] I got: 32 bytes of memory
anyway... technically, how you describe it, should work fine. Both take a size_t and return a void * so it should work; but be aware that the more code you move into the kernel the less deterministic things become, and that malloc()<=>kmalloc() isn't as 1:1 as it seems.
Trying to make my RTAI code compilable in both user and kernel spaces (as well as working with POSIX), I have developed URT which essentially does what you are asking. It's a lightweight abstraction level over real-time systems (and even over the inconsistent user-space vs kernel-space RTAI functions).

Is function call an effective memory barrier for modern platforms?

In a codebase I reviewed, I found the following idiom.
void notify(struct actor_t act) {
write(act.pipe, "M", 1);
}
// thread A sending data to thread B
void send(byte *data) {
global.data = data;
notify(threadB);
}
// in thread B event loop
read(this.sock, &cmd, 1);
switch (cmd) {
case 'M': use_data(global.data);break;
...
}
"Hold it", I said to the author, a senior member of my team, "there's no memory barrier here! You don't guarantee that global.data will be flushed from the cache to main memory. If thread A and thread B will run in two different processors - this scheme might fail".
The senior programmer grinned, and explained slowly, as if explaining his five years old boy how to tie his shoelaces: "Listen young boy, we've seen here many thread related bugs, in high load testing, and in real clients", he paused to scratch his longish beard, "but we've never had a bug with this idiom".
"But, it says in the book..."
"Quiet!", he hushed me promptly, "Maybe theoretically, it's not guaranteed, but in practice, the fact you used a function call is effectively a memory barrier. The compiler will not reorder the instruction global.data = data, since it can't know if anyone using it in the function call, and the x86 architecture will ensure that the other CPUs will see this piece of global data by the time thread B reads the command from the pipe. Rest assured, we have ample real world problems to worry about. We don't need to invest extra effort in bogus theoretical problems.
"Rest assured my boy, in time you'll understand to separate the real problem from the I-need-to-get-a-PhD non-problems."
Is he correct? Is that really a non-issue in practice (say x86, x64 and ARM)?
It's against everything I learned, but he does have a long beard and a really smart looks!
Extra points if you can show me a piece of code proving him wrong!
Memory barriers aren't just to prevent instruction reordering. Even if instructions aren't reordered it can still cause problems with cache coherence. As for the reordering - it depends on your compiler and settings. ICC is particularly agressive with reordering. MSVC w/ whole program optimization can be, too.
If your shared data variable is declared as volatile, even though it's not in the spec most compilers will generate a memory variable around reads and writes from the variable and prevent reordering. This is not the correct way of using volatile, nor what it was meant for.
(If I had any votes left, I'd +1 your question for the narration.)
In practice, a function call is a compiler barrier, meaning that the compiler will not move global memory accesses past the call. A caveat to this is functions which the compiler knows something about, e.g. builtins, inlined functions (keep in mind IPO!) etc.
So a processor memory barrier (in addition to a compiler barrier) is in theory needed to make this work. However, since you're calling read and write which are syscalls that change the global state, I'm quite sure that the kernel issues memory barriers somewhere in the implementation of those. There is no such guarantee though, so in theory you need the barriers.
The basic rule is: the compiler must make the global state appear to be exactly as you coded it, but if it can prove that a given function doesn't use global variables then it can implement the algorithm any way it chooses.
The upshot is that traditional compilers always treated functions in another compilation unit as a memory barrier because they couldn't see inside those functions. Increasingly, modern compilers are growing "whole program" or "link time" optimization strategies which break down these barriers and will cause poorly written code to fail, even though it's been working fine for years.
If the function in question is in a shared library then it won't be able to see inside it, but if the function is one defined by the C standard then it doesn't need to -- it already knows what the function does -- so you have to be careful of those also. Note that a compiler will not recognise a kernel call for what it is, but the very act of inserting something that the compiler can't recognise (inline assembler, or a function call to an assembler file) will create a memory barrier in itself.
In your case, notify will either be a black box the compiler can't see inside (a library function) or else it will contain a recognisable memory barrier, so you are most likely safe.
In practice, you have to write very bad code to fall over this.
In practice, he's correct and a memory barrier is implied in this specific case.
But the point is that if its presence is "debatable", the code is already too complex and unclear.
Really guys, use a mutex or other proper constructs. It's the only safe way to deal with threads and to write maintainable code.
And maybe you'll see other errors, like that the code is unpredictable if send() is called more than one time.

C memcpy() a function

Is there any method to calculate size of a function? I have a pointer to a function and I have to copy entire function using memcpy. I have to malloc some space and know 3rd parameter of memcpy - size. I know that sizeof(function) doesn't work. Do you have any suggestions?
Functions are not first class objects in C. Which means they can't be passed to another function, they can't be returned from a function, and they can't be copied into another part of memory.
A function pointer though can satisfy all of this, and is a first class object. A function pointer is just a memory address and it usually has the same size as any other pointer on your machine.
It doesn't directly answer your question, but you should not implement call-backs from kernel code to user-space.
Injecting code into kernel-space is not a great work-around either.
It's better to represent the user/kernel barrier like a inter-process barrier. Pass data, not code, back and forth between a well defined protocol through a char device. If you really need to pass code, just wrap it up in a kernel module. You can then dynamically load/unload it, just like a .so-based plugin system.
On a side note, at first I misread that you did want to pass memcpy() to the kernel. You have to remind that it is a very special function. It is defined in the C standard, quite simple, and of a quite broad scope, so it is a perfect target to be provided as a built-in by the compiler.
Just like strlen(), strcmp() and others in GCC.
That said, the fact that is a built-in does not impede you ability to take a pointer to it.
Even if there was a way to get the sizeof() a function, it may still fail when you try to call a version that has been copied to another area in memory. What if the compiler has local or long jumps to specific memory locations. You can't just move a function in memory and expect it to run. The OS can do that but it has all the information it takes to do it.
I was going to ask how operating systems do this but, now that I think of it, when the OS moves stuff around it usually moves a whole page and handles memory such that addresses translate to a page/offset. I'm not sure even the OS ever moves a single function around in memory.
Even in the case of the OS moving a function around in memory, the function itself must be declared or otherwise compiled/assembled to permit such action, usually through a pragma that indicates the code is relocatable. All the memory references need to be relative to its own stack frame (aka local variables) or include some sort of segment+offset structure such that the CPU, either directly or at the behest of the OS, can pick the appropriate segment value. If there was a linker involved in creating the app, the app may have to be
re-linked to account for the new function address.
There are operating systems which can give each application its own 32-bit address space but it applies to the entire process and any child threads, not to an individual function.
As mentioned elsewhere, you really need a language where functions are first class objects, otherwise you're out of luck.
You want to copy a function? I do not think that this is possible in C generally.
Assume, you have a Harvard-Architecture microcontroller, where code (in other words "functions") is located in ROM. In this case you cannot do that at all.
Also I know several compilers and linkers, which do optimization on file (not only function level). This results in opcode, where parts of C functions are mixed into each other.
The only way which I consider as possible may be:
Generate opcode of your function (e.g. by compiling/assembling it on its own).
Copy that opcode into an C array.
Use a proper function pointer, pointing to that array, to call this function.
Now you can perform all operations, common to typical "data", on that array.
But apart from this: Did you consider a redesign of your software, so that you do not need to copy a functions content?
I don't quite understand what you are trying to accomplish, but assuming you compile with -fPIC and don't have your function do anything fancy, no other function calls, not accessing data from outside function, you might even get away with doing it once. I'd say the safest possibility is to limit the maximum size of supported function to, say, 1 kilobyte and just transfer that, and disregard the trailing junk.
If you really needed to know the exact size of a function, figure out your compiler's epilogue and prologue. This should look something like this on x86:
:your_func_epilogue
mov esp, ebp
pop ebp
ret
:end_of_func
;expect a varying length run of NOPs here
:next_func_prologue
push ebp
mov ebp, esp
Disassemble your compiler's output to check, and take the corresponding assembled sequences to search for. Epilogue alone might be enough, but all of this can bomb if searched sequence pops up too early, e.g. in the data embedded by the function. Searching for the next prologue might also get you into trouble, i think.
Now please ignore everything that i wrote, since you apparently are trying to approach the problem in the wrong and inherently unsafe way. Paint us a larger picture please, WHY are you trying to do that, and see whether we can figure out an entirely different approach.
A similar discussion was done here:
http://www.motherboardpoint.com/getting-code-size-function-c-t95049.html
They propose creating a dummy function after your function-to-be-copied, and then getting the memory pointers to both. But you need to switch off compiler optimizations for it to work.
If you have GCC >= 4.4, you could try switching off the optimizations for your function in particular using #pragma:
http://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html#Function-Specific-Option-Pragmas
Another proposed solution was not to copy the function at all, but define the function in the place where you would want to copy it to.
Good luck!
If your linker doesn't do global optimizations, then just calculate the difference between the function pointer and the address of the next function.
Note that copying the function will produce something which can't be invoked if your code isn't compiled relocatable (i.e. all addresses in the code must be relative, for example branches; globals work, though since they don't move).
It sounds like you want to have a callback from your kernel driver to userspace, so that it can inform userspace when some asynchronous job has finished.
That might sound sensible, because it's the way a regular userspace library would probably do things - but for the kernel/userspace interface, it's quite wrong. Even if you manage to get your function code copied into the kernel, and even if you make it suitably position-independent, it's still wrong, because the kernel and userspace code execute in fundamentally different contexts. For just one example of the differences that might cause problems, if a page fault happens in kernel context due to a swapped-out page, that'll cause a kernel oops rather than swapping the page in.
The correct approach is for the kernel to make some file descriptor readable when the asynchronous job has finished (in your case, this file descriptor almost certainly be the character device your driver provides). The userspace process can then wait for this event with select / poll, or with read - it can set the file descriptor non-blocking if wants, and basically just use all the standard UNIX tools for dealing with this case. This, after all, is how the asynchronous nature of network sockets (and pretty much every other asychronous case) is handled.
If you need to provide additional information about what the event that occured, that can be made available to the userspace process when it calls read on the readable file descriptor.
Function isn't just object you can copy. What about cross-references / symbols and so on? Of course you can take something like standard linux "binutils" package and torture your binaries but is it what you want?
By the way if you simply are trying to replace memcpy() implementation, look around LD_PRELOAD mechanics.
I can think of a way to accomplish what you want, but I won't tell you because it's a horrific abuse of the language.
A cleaner method than disabling optimizations and relying on the compiler to maintain order of functions is to arrange for that function (or a group of functions that need copying) to be in its own section. This is compiler and linker dependant, and you'll also need to use relative addressing if you call between the functions that are copied. For those asking why you would do this, its a common requirement in embedded systems that need to update the running code.
My suggestion is: don't.
Injecting code into kernel space is such an enormous security hole that most modern OSes forbid self-modifying code altogether.
As near as I can tell, the original poster wants to do something that is implementation-specific, and so not portable; this is going off what the C++ standard says on the subject of casting pointers-to-functions, rather than the C standard, but that should be good enough here.
In some environments, with some compilers, it might be possible to do what the poster seems to want to do (that is, copy a block of memory that is pointed to by the pointer-to-function to some other location, perhaps allocated with malloc, cast that block to a pointer-to-function, and call it directly). But it won't be portable, which may not be an issue. Finding the size required for that block of memory is itself dependent on the environment, and compiler, and may very well require some pretty arcane stuff (e.g., scanning the memory for a return opcode, or running the memory through a disassembler). Again, implementation-specific, and highly non-portable. And again, may not matter for the original poster.
The links to potential solutions all appear to make use of implementation-specific behaviour, and I'm not even sure that they do what the purport to do, but they may be suitable for the OP.
Having beaten this horse to death, I am curious to know why the OP wants to do this. It would be pretty fragile even if it works in the target environment (e.g., could break with changes to compiler options, compiler version, code refactoring, etc). I'm glad that I don't do work where this sort of magic is necessary (assuming that it is)...
I have done this on a Nintendo GBA where I've copied some low level render functions from flash (16 bit access slowish memory) to the high speed workspace ram (32 bit access, at least twice as fast). This was done by taking the address of the function immdiately after the function I wanted to copy, size = (int) (NextFuncPtr - SourceFuncPtr). This did work well but obviously cant be garunteed on all platforms (does not work on Windows for sure).
I think one solution can be as below.
For ex: if you want to know func() size in program a.c, and have indicators before and after the function.
Try writing a perl script which will compile this file into object format(cc -o) make sure that pre-processor statements are not removed. You need them later on to calculate the size from object file.
Now search for your two indicators and find out the code size in between.

Resources