C99: compiler optimizations when accessing global variables and aliased memory pointers - c

I'm writing C code for an embedded system. In this system, there are memory mapped registers at some fixed address in the memory map, and of course some RAM where my data segment / heap is.
I'm finding problems generating optimal code when my code is intermixing accesses to global variables in the data segment and accesses to hardware registers. This is a simplified snippet:
#include <stdint.h>
uint32_t * const restrict HWREGS = 0x20000;
struct {
uint32_t a, b;
} Context;
void example(void) {
Context.a = 123;
HWREGS[0x1234] = 5;
Context.b = Context.a;
}
This is the code generated on x86 (see also on godbolt):
example:
mov DWORD PTR Context[rip], 123
mov DWORD PTR ds:149712, 5
mov eax, DWORD PTR Context[rip]
mov DWORD PTR Context[rip+4], eax
ret
As you can see, after having written the hardware register, Context.a is reloaded from RAM before being stored into Context.b. This doesn't make sense because Context is at a different memory address than HWREGS. In other words, the memory pointed by HWREGS and the memory pointed by &Context do not alias, but it looks like there is not way to tell that to the compiler.
If I change HWREGS definition as this:
extern uint32_t * const restrict HWREGS;
that is, I hide the fixed memory address to the compiler, I get this:
example:
mov rax, QWORD PTR HWREGS[rip]
mov DWORD PTR [rax+18640], 5
movabs rax, 528280977531
mov QWORD PTR Context[rip], rax
ret
Context:
.zero 8
Now the two writes to Context are optimized (even coalesced to a single write), but on the other hand the access to the hardware register does not happen anymore with a direct memory access but it goes through a pointer indirection.
Is there a way to obtain optimal code here? I would like GCC to know that HWREGS is at a fixed memory address and at the same time to tell it that it does not alias Context.

If you want to avoid compilers reloading regularly values from a memory region (possibly due to aliasing), then the best is not to use global variables, or at least not to use direct accesses to global variables. The register keyword seems ignored for global variables (especially here on HWREGS) for both GCC and Clang. Using the restrict keyword on function parameters solves this problem:
#include <stdint.h>
uint32_t * const HWREGS = 0x20000;
struct Context {
uint32_t a, b;
} context;
static inline void exampleWithLocals(uint32_t* restrict localRegs, struct Context* restrict localContext) {
localContext->a = 123;
localRegs[0x1234] = 5;
localContext->b = localContext->a;
}
void example() {
exampleWithLocals(HWREGS, &context);
}
Here is the result (see also on godbolt):
example:
movabs rax, 528280977531
mov DWORD PTR ds:149712, 5
mov QWORD PTR context[rip], rax
ret
context:
.zero 8
Please note that the strict aliasing rule do not help in this case since the type of read/written variables/fields is always uint32_t.
Besides this, based on its name, the variable HWREGS looks like a hardware register. Please note that it should be put volatile so that compiler do not keep it to registers nor perform any similar optimization (like assuming the pointed value is left unchanged if the code do not change it).

Related

How are oversized struct returned on the stack?

It is said that returning an oversized struct by value (as opposed to returning a pointer to the struct) from a function incurs unnecessary copy on the stack. By "oversized", I mean a struct that cannot fit in the return registers.
However, to quote Wikipedia
When an oversized struct return is needed, another pointer to a caller-provided space is prepended as the first argument, shifting all other arguments to the right by one place.
and
When returning struct/class, the calling code allocates space and passes a pointer to this space via a hidden parameter on the stack. The called function writes the return value to this address.
It appears that at least on x86 architectures, the struct in question is directly written by the callee to the memory appointed by the caller, so why would there be a copy then? Does returning oversized structs really incur copy on the stack?
If the function inlines, the copying through the return-value object can be fully optimized away. Otherwise, maybe not, and arg copying definitely can't be.
It appears that at least on x86 architectures, the struct in question is directly written by the callee to the memory appointed by the caller, so why would there be a copy then? Does returning oversized structs really incur copy on the stack?
It depends what the caller does with the return value,; if it's assigned to a provably private object (escape analysis), that object can be the return-value object, passed as the hidden pointer.
But if the caller actually wants to assign the return value to other memory, then it does need a temporary.
struct large retval = some_func(); // no extra copying at all
*p = some_func() // caller will make space for a local return-value object & copy.
(Unless the compiler knows that p is just pointing to a local struct large tmp;, and escape analysis can prove that there's no way some global variable could have a pointer to that same tmp var.)
long version, same thing with more details:
In the C abstract machine, there's a "return value object", and return foo copies the named variable foo to that object, even if it's a large struct. Or return (struct lg){1,2}; copies an anonymous struct. The return-value object itself is anonymous; nothing can take its address. (You can't int *p = &foo(123);). This makes it easier to optimize away.
In the caller, that anonymous return-value object can be assigned to whatever you want, which would be another copy if compilers didn't optimize anything. (All of this applies for any type, even int). Of course, compilers that aren't total garbage will avoid some, ideally all, of that copying, when doing so can't possibly change the observable results. And that depends on the design of the calling convention. As you say, most conventions, including all the mainstream x86 and x86-64 conventions, pass a "hidden pointer" arg for return values they choose not to return in register(s) for whatever reason (size, C++ having a non-trivial constructor).
struct large retval = foo(...);
For such calling conventions, the above code is effectively transformed to
struct large retval;
foo(&retval, ...);
So it's C return-value object actually is a local in the stack-frame of its caller. foo() is allowed to store into that return-value object whenever it wants during execution, including before reading some other objects. This allows optimization within the callee (foo) as well, so a struct large tmp = ... / return tmp can be optimized away to just store into the return-value object.
So there's zero extra copying when the caller does just want to assign the function return value to a newly declared local var. (Or to a local var which it can prove is still private, via escape analysis. i.e. not pointed-to by any global vars).
But what if the caller wants to store the return value somewhere else?
void caller2(struct large *lgp) {
*lgp = foo();
}
Can *lgp be the return-value object, or do we need to introduce a local temporary?
void caller2(struct large *lgp) {
// foo_asm(lgp); // nope, possibly unsafe
struct large retval; foo(&retval); *lgp = retval; // safe
}
If you want functions to be able to write large structs to arbitrary locations, you have to "sign off" on it by making that effect visible in your source.
What prevents the usage of a function argument as hidden pointer? for more details about why *lgp can't be the return-value object / hidden pointer, and another example. "A function is allowed to assume its return-value object (pointed-to by a hidden pointer) is not the same object as anything else". Also details of whether struct large *restrict lgp would make it safe: probably yes if the function doesn't longjmp (otherwise stores to the supposedly anonymous retval object might end up as visible side effects without return having been reached), but GCC doesn't look for that optimization.
Why is tailcall optimization not performed for types of class MEMORY? - return bar() where bar returns the same struct should be possible as an optimized tailcall, causing extra copying. This can even introduce extra copying of the whole struct, as well as failing to optimize call bar / ret into jmp bar.
how c compiler treats a struct return value from a function, in ASM - thresholds for returning in registers. e.g. i386 System V always returns structs in memory, even struct {int x;};.
Is it possible within a function to get the memory address of the variable initialized by the return value?
C/C++ returning struct by value under the hood an actual example (but unfortunately using debug-mode compiler-generated asm, so it contains copying that isn't necessary).
How do objects work in x86 at the assembly level? example at the bottom of how x86-64 System V packs the bytes of a struct into RDX:RAX, or just RAX if less than 8 bytes.
An example showing early stores to the return-value object (instead of copying)
(all source + asm on the Godbolt compiler explorer)
// more or less extra size will get compilers to copy it around with SSE2 or not
struct large { int first, second; char pad[0];};
int *global_ptr;
extern int a;
NOINLINE // __attribute__((noinline))
struct large foo() {
struct large tmp = {1,2};
if (a)
tmp.second = *global_ptr;
return tmp;
}
(targeting GNU/Linux) clang -m32 -O3 -mregparm=1 creates an implementation that writes its return-value object before it's done reading everything else, exactly the case that would make it unsafe for the caller to pass a pointer to some globally-reachable memory.
The asm makes it clear that tmp is fully optimized away, or is the retval object.
# clang -O3 -m32 -mregparm=1
foo:
mov dword ptr [eax + 4], 2
mov dword ptr [eax], 1 # store tmp into the retval object
cmp dword ptr [a], 0
je .LBB0_2 # if (a == 0) goto ret
mov ecx, dword ptr [global_ptr] # load the global
mov ecx, dword ptr [ecx] # deref it
mov dword ptr [eax + 4], ecx # and store to the retval object
.LBB0_2:
ret
(-mregparm=1 means pass the first arg in EAX, less noisy and easier to quickly visually distinguish from stack space than passing on the stack. Fun fact: i386 Linux compiles the kernel with -mregparm=3. But fun fact #2: if a hidden pointer is passed on the stack (i.e. no regparm), that arg is callee pops, unlike the rest. The function will use ret 4 to do ESP+=4 after popping the return address into EIP.)
In a simple caller, the compiler just reserves some stack space, passes a pointer to it, and then can load member variables from that space.
int caller() {
struct large lg = {4, 5}; // initializer is dead, foo can't read its retval object
lg = foo();
return lg.second;
}
caller:
sub esp, 12
mov eax, esp
call foo
mov eax, dword ptr [esp + 4]
add esp, 12
ret
But with a less trivial caller:
int caller() {
struct large lg = {4, 5};
global_ptr = &lg.first;
// unknown(&lg); // or this: as a side effect, might set global_ptr = &tmp->first;
lg = foo(); // (except by inlining) the compiler can't know if foo() looks at global_ptr
return lg.second;
}
caller:
sub esp, 28 # reserve space for 2 structs, and alignment
mov dword ptr [esp + 12], 5
mov dword ptr [esp + 8], 4 # materialize lg
lea eax, [esp + 8]
mov dword ptr [global_ptr], eax # point global_ptr at it
lea eax, [esp + 16] # hidden first arg *not* pointing to lg
call foo
mov eax, dword ptr [esp + 20] # reload from the retval object
add esp, 28
ret
Extra copying with *lgp = foo();
int caller2(struct large *lgp) {
global_ptr = &lgp->first;
*lgp = foo();
return lgp->second;
}
# with GCC11.1 this time, SSE2 8-byte copying unlike clang
caller2: # incoming arg: struct large *lgp in EAX
push ebx #
mov ebx, eax # lgp, tmp89 # lgp needed after foo returns
sub esp, 24 # reserve space for a retval object (and waste 16 bytes)
mov DWORD PTR global_ptr, eax # global_ptr, lgp
lea eax, [esp+8] # hidden pointer to the retval object
call foo #
movq xmm0, QWORD PTR [esp+8] # 8-byte copy of both halves
movq QWORD PTR [ebx], xmm0 # *lgp_2(D), tmp86
mov eax, DWORD PTR [ebx+4] # lgp_2(D)->second, lgp_2(D)->second # reload int return value
add esp, 24
pop ebx
ret
The copy to *lgp needs to happen, but it's somewhat of a missed optimization to reload from there, instead of from [esp+12]. (Saves a byte of code size at the cost of more latency.)
Clang does the copy with two 4-byte integer register mov loads/stores, but one of them is into EAX so it already has the return value ready.
You might also want to look at the result of assigning to memory freshly allocated with malloc. Compilers know that nothing else can (legally) be pointing to the newly allocated memory: that would be use-after-free undefined behaviour. So they may allow passing on a pointer from malloc as the return-value object if it hasn't been passed to anything else yet.
Related fun fact: passing large structs by value always requires a copy (if the function doesn't inline). But as discussed in comments, the details depend on the calling convention. Windows differs from i386 / x86-64 System V calling conventions (all non-Windows OSes) on this:
SysV calling conventions copy the whole struct to the stack. (if they're too large to fit in a pair of registers for x86-64)
Windows x64 makes a copy and passes (like a normal arg) a pointer to that copy. The callee "owns" the arg and can modify it, so a tmp copy is still needed. (And no, const struct large foo has no effect.)
https://godbolt.org/z/ThMrE9rqT shows x86-64 GCC targeting Linux vs. x64 MSVC targeting Windows.
This really depends on your compiler, but in general the way this works is that the caller allocates the memory for the struct return value, but the callee also allocates stack space for any intermediate value of that structure. This intermediate allocation is used when the function is running, and then the struct is copied onto the caller's memory when the function returns.
For reference as to why your solution won't always work, consider a program which has two of the same struct and returns one based on some condition:
large_t returntype(int condition) {
large_t var1 = {5};
large_t var2 = {6};
// More intermediate code here
if(condition) return var1;
else return var2;
}
In this case, both may be required by the intermediate code, but the return value is not known at compile time, so the compiler doesn't know which to initialize on the caller's stack space. It's easier to just keep it local and copy on return.
EDIT: Your solution may be the case in simple functions, but it really depends on the optimizations performed by each individual compiler. If you're really interested in this, check out https://godbolt.org/

Most efficient way to initialise variables in C

I am working on an embedded application in which RAM availability is very limited. In analysing the memory consumption I see that the bss segment is rather large and researching more it sounds like this can be due to lack of variable initialisation.
This is an example of the sort of thing I would normally do. Lets say I have this struct:
typedef struct
{
float time;
float value;
} pair_t;
I normally don't initialise variables on declaration, so instead do something like this:
pair_t test()
{
pair_t ret;
ret.time = 0;
ret.value = 0;
return ret;
}
Would I be better off doing it all in one, like this? Or does it make no difference?
pair_t test()
{
pair_t ret = (pair_t)
{
.time = 0,
.value = 0
};
return ret;
}
In analysing the memory consumption I see that the bss segment is rather large and researching more it sounds like this can be due to lack of variable initialisation.
A large bss segment simply means that your application has a lot of global and/or static variables (or some very large ones). These variables are in the bss segment because they are initialized to zero, or are left uninitialized -- but initializing them to nonzero values will just make them move to the data segment, which also resides in memory. If the memory usage of these variables is an issue, you need to use less of them. Changing the way they're initialized won't help.
That all being said, the variable you're dealing with here isn't in the bss segment at all, because it's a local variable. And there is absolutely no difference between initializing the variable when it's declared and initializing it explicitly with assignment statements; any sensible compiler will generate the exact same code for both.
TL;DR
if you care only about zeroing the structure, the least to write is = {0} in initializing the structure, and most probably this will also result in the best code, except in cases where you do not want to initialize all members; to not initialize all members you must use the strategy 1, i.e. assignments to members.
A good implementation can notice that each variant has the same side effects and generate identical code for each option if optimizations are enabled. A bad implementation might not, just as it is very much easier for us humans to see that 0 is zero than it is to see that
is zero too.
Notice that
pair_t test()
{
pair_t ret = (pair_t)
{
.time = 0,
.value = 0
};
return ret;
}
additionally creates a compound literal of type pair_t which is unused. But a bad compiler might keep it around. The proper initialization code is
pair_t ret = {
.time = 0,
.value = 0
};
without that cast-looking thing.
And it might even be cheaper to use the simple zero initializer
pair_t ret = { 0 };
which should have the same effect but is even more explicitly zero.
At least MSVC seems to be tricked without optimizations enabled, c.f. this one.
But when optimizations enabled (-O3), GCC 9.1 x86-64 will generate
test:
pxor xmm0, xmm0
ret
for all options, but again this is a quality-of-implementation issue, for a compiler of inferior quality (MSVC 19.14) might generate shortest code for only strictly default-initializer {0}:
ret$ = 8
test PROC ; COMDAT
xor eax, eax
mov QWORD PTR ret$[rsp], rax
ret 0
test ENDP
If you compare this with using the strictly correct {0.0f, 0.0f}:
test PROC ; COMDAT
xorps xmm1, xmm1
xorps xmm0, xmm0
unpcklps xmm0, xmm1
movq rax, xmm0
ret 0
test ENDP

How to save the context execution of a newly created user thread, Linux 64 to structure in C?

I am trying to implement a new user thread management library similar to the original pthread but only in C. Only the context switch should be assembler.
Looks like I am missing something fundamentally.
I have the following structure for the context execution:
enter code here
struct exec_ctx {
uint64_t rbp;
uint64_t r15;
uint64_t r14;
uint64_t r13;
uint64_t r12;
uint64_t r11;
uint64_t r10;
uint64_t r9;
uint64_t r8;
uint64_t rsi;
uint64_t rdi;
uint64_t rdx;
uint64_t rcx;
uint64_t rbx;
uint64_t rip;
}__attribute__((packed));
I create new thread structure and I should put the registers into the mentioned variables, part of the context execution structure. How may I do it on C? Everywhere only talks about setcontext, getcontext, but this is not the case here.
Also, the only hint I received is I need to have some kind of dump stack function into the create function.... not sure how to do it. Please advise where can I read further/how to do it.
Thanks in advance!
I started with:
char *stack;
stack = malloc(StackSize);
if (!stack)
return -1;
*(uint64_t *)&stack[StackSize - 8] = (uint64_t)stop;
*(uint64_t *)&stack[StackSize - 16] = (uint64_t)f;
pet_thread->ctx.rip = (uint64_t)&stack[StackSize - 16];
pet_thread->thread_state = Ready;
This is how I put a pointer to the thread function on the top of the stack in order to call the thread more easily.
First of all, you do not need to save all the registers. Since your context switch is implemented as a function, any register that the ABI defines as "caller saved" or "clobbered" you can safely leave out. The code generated by the C compiler will assume it might change.
Since this is a school assignment I will not give you the code to do this. I will give you the outline.
Your function needs to both save the registers to the struct for the outgoing micro-thread and load the register for the incoming micro-thread. The reason is that you have logically always have one register set "in effect". So your function needs two arguments, the struct for the outgoing micro-thread and the one for the incoming.
Those two arguments are stored in two registers. Those two you do not need to save. So your code should have the following structure (assuming your structure, which, as I said, is too complete):
# save context
mov [rdi], rbp
add 8, rdi
...
#load context
mov rbp, [rsi]
add 8, rsi
...
If you place that in a separate .S file, you'll make sure that the C compiler will not add anything or optimize anything.
This is not the cleanest or most efficient solution, but it is the simplest.

Generating functions at runtime in C

I would like to generate a function at runtime in C. And by this I mean I would essentially like to allocate some memory, point at it and execute it via function pointer. I realize this is a very complex topic and my question is naïve. I also realize there are some very robust libraries out there that do this (e.g. nanojit).
But I would like to learn the technique, starting with the basics. Could someone knowledgeable give me a very simple example in C?
EDIT: The answer below is great but here is the same example for Windows:
#include <Windows.h>
#define MEMSIZE 100*1024*1024
typedef void (*func_t)(void);
int main() {
HANDLE proc = GetCurrentProcess();
LPVOID p = VirtualAlloc(
NULL,
MEMSIZE,
MEM_RESERVE|MEM_COMMIT,
PAGE_EXECUTE_READWRITE);
func_t func = (func_t)p;
PDWORD code = (PDWORD)p;
code[0] = 0xC3; // ret
if(FlushInstructionCache(
proc,
NULL,
0))
{
func();
}
CloseHandle(proc);
VirtualFree(p, 0, MEM_RELEASE);
return 0;
}
As said previously by other posters, you'll need to know your platform pretty well.
Ignoring the issue of casting a object pointer to a function pointer being, technically, UB, here's an example that works for x86/x64 OS X (and possibly Linux too). All the generated code does is return to the caller.
#include <unistd.h>
#include <sys/mman.h>
typedef void (*func_t)(void);
int main() {
/*
* Get a RWX bit of memory.
* We can't just use malloc because the memory it returns might not
* be executable.
*/
unsigned char *code = mmap(NULL, getpagesize(),
PROT_READ|PROT_EXEC|PROT_WRITE,
MAP_SHARED|MAP_ANON, 0, 0);
/* Technically undefined behaviour */
func_t func = (func_t) code;
code[0] = 0xC3; /* x86 'ret' instruction */
func();
return 0;
}
Obviously, this will be different across different platforms but it outlines the basics needed: get executable section of memory, write instructions, execute instructions.
This requires you to know your platform. For instance, what is the C calling convention on your platform? Where are parameters stored? What register holds the return value? What registers must be saved and restored? Once you know that, you can essentially write some C code that assembles code into a block of memory, then cast that memory into a function pointer (though this is technically forbidden in ANSI C, and will not work depending if your platform marks some pages of memory as non-executable aka NX bit).
The simple way to go about this is simply to write some code, compile it, then disassemble it and look at what bytes correspond to which instructions. You can write some C code that fills allocated memory with that collection of bytes and then casts it to a function pointer of the appropriate type and executes.
It's probably best to start by reading the calling conventions for your architecture and compiler. Then learn to write assembly that can be called from C (i.e., follows the calling convention).
If you have tools, they can help you get some things right easier. For example, instead of trying to design the right function prologue/epilogue, I can just code this in C:
int foo(void* Data)
{
return (Data != 0);
}
Then (MicrosoftC under Windows) feed it to "cl /Fa /c foo.c". Then I can look at "foo.asm":
_Data$ = 8
; Line 2
push ebp
mov ebp, esp
; Line 3
xor eax, eax
cmp DWORD PTR _Data$[ebp], 0
setne al
; Line 4
pop ebp
ret 0
I could also use "dumpbin /all foo.obj" to see that the exact bytes of the function were:
00000000: 55 8B EC 33 C0 83 7D 08 00 0F 95 C0 5D C3
Just saves me some time getting the bytes exactly right...

Multithreading with inline assembly and access to a c variable

I'm using inline assembly to construct a set of passwords, which I will use to brute force against a given hash. I used this website as a reference for the construction of the passwords.
This is working flawlessly in a singlethreaded environment. It produces an infinite amount of incrementing passwords.
As I have only basic knowledge of asm, I understand the idea. The gcc uses ATT, so I compile with -masm=intel
During the attempt to multithread the program, I realize that this approach might not work.
The following code uses 2 global C variables, and I assume that this might be the problem.
__asm__("pushad\n\t"
"mov edi, offset plaintext\n\t" <---- global variable
"mov ebx, offset charsetTable\n\t" <---- again
"L1: movzx eax, byte ptr [edi]\n\t"
" movzx eax, byte ptr [charsetTable+eax]\n\t"
" cmp al, 0\n\t"
" je L2\n\t"
" mov [edi],al\n\t"
" jmp L3\n\t"
"L2: xlat\n\t"
" mov [edi],al\n\t"
" inc edi\n\t"
" jmp L1\n\t"
"L3: popad\n\t");
It produces a non deterministic result in the plaintext variable.
How can i create a workaround, that every thread accesses his own plaintext variable? (If this is the problem...).
I tried modifying this code, to use extended assembly, but I failed every time. Probably due to the fact that all tutorials use ATT syntax.
I would really appreciate any help, as I'm stuck for several hours now :(
Edit: Running the program with 2 threads, and printing the content of plaintext right after the asm instruction, produces:
b
b
d
d
f
f
...
Edit2:
pthread_create(&thread[i], NULL, crack, (void *) &args[i]))
[...]
void *crack(void *arg) {
struct threadArgs *param = arg;
struct crypt_data crypt; // storage for reentrant version of crypt(3)
char *tmpHash = NULL;
size_t len = strlen(param->methodAndSalt);
size_t cipherlen = strlen(param->cipher);
crypt.initialized = 0;
for(int i = 0; i <= LIMIT; i++) {
// intel syntax
__asm__ ("pushad\n\t"
//mov edi, offset %0\n\t"
"mov edi, offset plaintext\n\t"
"mov ebx, offset charsetTable\n\t"
"L1: movzx eax, byte ptr [edi]\n\t"
" movzx eax, byte ptr [charsetTable+eax]\n\t"
" cmp al, 0\n\t"
" je L2\n\t"
" mov [edi],al\n\t"
" jmp L3\n\t"
"L2: xlat\n\t"
" mov [edi],al\n\t"
" inc edi\n\t"
" jmp L1\n\t"
"L3: popad\n\t");
tmpHash = crypt_r(plaintext, param->methodAndSalt, &crypt);
if(0 == memcmp(tmpHash+len, param->cipher, cipherlen)) {
printf("success: %s\n", plaintext);
break;
}
}
return 0;
}
Since you're already using pthreads, another option is making the variables that are modified by several threads into per-thread variables (threadspecific data). See pthread_getspecific OpenGroup manpage. The way this works is like:
In the main thread (before you create other threads), do:
static pthread_key_y tsd_key;
(void)pthread_key_create(&tsd_key); /* unlikely to fail; handle if you want */
and then within each thread, where you use the plaintext / charsetTable variables (or more such), do:
struct { char *plainText, char *charsetTable } *str =
pthread_getspecific(tsd_key);
if (str == NULL) {
str = malloc(2 * sizeof(char *));
str.plainText = malloc(size_of_plaintext);
str.charsetTable = malloc(size_of_charsetTable);
initialize(str.plainText); /* put the data for this thread in */
initialize(str.charsetTable); /* ditto */
pthread_setspecific(tsd_key, str);
}
char *plaintext = str.plainText;
char *charsetTable = str.charsetTable;
Or create / use several keys, one per such variable; in that case, you don't get the str container / double indirection / additional malloc.
Intel assembly syntax with gcc inline asm is, hm, not great; in particular, specifying input/output operands is not easy. I think to get that to use the pthread_getspecific mechanism, you'd change your code to do:
__asm__("pushad\n\t"
"push tsd_key\n\t" <---- threadspecific data key (arg to call)
"call pthread_getspecific\n\t" <---- gets "str" as per above
"add esp, 4\n\t" <---- get rid of the func argument
"mov edi, [eax]\n\t" <---- first ptr == "plainText"
"mov ebx, [eax + 4]\n\t" <---- 2nd ptr == "charsetTable"
...
That way, it becomes lock-free, at the expense of using more memory (one plaintext / charsetTable per thread), and the expense of an additional function call (to pthread_getspecific()). Also, if you do the above, make sure you free() each thread's specific data via pthread_atexit(), or else you'll leak.
If your function is fast to execute, then a lock is a much simpler solution because you don't need all the setup / cleanup overhead of threadspecific data; if the function is either slow or very frequently called, the lock would become a bottleneck though - in that case the memory / access overhead for TSD is justified. Your mileage may vary.
Protect this function with mutex outside of inline Assembly block.

Resources