What does each entry in the Jmp_buf structure hold? - c

I am running Ubuntu 9.10 (Karmic Koala), and I took a look at the jmp_buf structure which is simply an array of 12 ints. When I use setjmp, and pass in a jmp_buf structure—4 out of 12 entries are saved off. These 4 entries are the stack pointer, frame pointer, program counter and return address. What are the other 8 entries for? Are they machine-dependent? Is another entry the segment table base register? What else is needed to properly restore a thread/process's environment? I looked through the man page, other sources, but I couldn't find the assembly code for setjmp.

On MacOS X 10.6.2, the header <setjmp.h> ends up using <i386/setjmp.h>, and in there it says:
#if defined(__x86_64__)
/*
* _JBLEN is number of ints required to save the following:
* rflags, rip, rbp, rsp, rbx, r12, r13, r14, r15... these are 8 bytes each
* mxcsr, fp control word, sigmask... these are 4 bytes each
* add 16 ints for future expansion needs...
*/
#define _JBLEN ((9 * 2) + 3 + 16)
typedef int jmp_buf[_JBLEN];
typedef int sigjmp_buf[_JBLEN + 1];
#else
/*
* _JBLEN is number of ints required to save the following:
* eax, ebx, ecx, edx, edi, esi, ebp, esp, ss, eflags, eip,
* cs, de, es, fs, gs == 16 ints
* onstack, mask = 2 ints
*/
#define _JBLEN (18)
typedef int jmp_buf[_JBLEN];
typedef int sigjmp_buf[_JBLEN + 1];
#endif
You would probably find similar requirements on Linux - the jmp_buf contains enough information to store the necessary state. And, to use it, you really don't need to know what it contains; all you need to do is trust that the implementers got it correct. If you want to alter the implementation, then you do need to understand it, of course.
Note that setjmp and longjmp are very machine specific. Read Plauger's "The Standard C Library" for a discussion of some of the issues involved in implementing them. More modern chips make it harder to implement really well.

setjmp/longjmp/sigsetjmp are highly dependent on the CPU architecture, operating system, and threading model. The first two functions famously (or infamously—depending on your POV) appeared in the original Unix kernel as a "structured" way to unwind out of a failed system call, as from an i/o error or other nasty situations.
The structure's comments in /usr/include/setjmp.h (Linux Fedora) say Calling environment, plus possibly a saved signal mask. It includes /usr/include/bits/setjmp.h to declare jmp_buf to have an array of six 32-bit ints, apparently specific to the x86 family.
While I couldn't find source other than a PPC implementation, the comments there reasonably hint that FPU settings should be saved. That makes sense since failing to restore the rounding mode, default operand size, exception handling, etc. would be surprising.
It's typical of system engineers to reserve a little more space than actually needed in such a structure. A few extra bytes are hardly anything to sweat—especially considering the rarity of actual uses of setjmp/longjmp. Having too little space definitely is a hazard. The most salient reason I can think of is having extra—as opposed to being spot on—is that if the runtime library version is changed to need more space in jmp_buf, by having extra room already reserved, there's no need to recompile programs referring to it.

Related

Using memcmp with DOS far pointers

I have an old program I wrote in 1995. It is written with Borland C and DOS 6.22. It uses a far model with data in different segments. The program uses EMS memory and that is why the pointers need to be far. I need to use memcmp( a, b, c) but I get an error "Warning panel.c 325: Suspicious pointer conversion in function enterPanel" and I suspect that is because I have a far pointer. Is there a far version of memcpy that I should be using? (I searched for such a function but couldn't find it). You may be wondering why I don't just code a loop but I want to use the intrinsic capability to get the most speed.
Here is a snip from my code:
ASM function
------------
align 4
public _mainMem
_mainMem label dword
mainMem dd ? ;pointer to main memory (if no EMS)
mov ax,emsSegment ;segment of EMS page frame
mov word ptr mainMem+2,ax
mov word ptr mainMem,0
C program
---------
extern unsigned short _far *mainMem;
short watchArray[80];
unsigned int watchAddress, watchLen;
memcmp((_far*)&mainMem[watchAddress], watchArray, watchLen)
Also I tried removing the (_far*).

How avoid cache line invalidation from multiple threads writing to a shared array?

Context of the problem:
I am writing a code that creates 32 threads, and set affinity of them to each one of the 32 cores in my multi-core-multi-processor system.
Threads simply execute the RDTSCP instruction and the value is stored in a shared array at a non-overlapping position, this is the shared array:
uint64_t rdtscp_values[32];
So, every thread is going to write to the specific array position based on its core number.
Up to know, everything is working properly with the exception that I know that I may not be using the right data structure to avoid cache line bouncing.
P.S: I have checked already that my processor's cache line is 64-bytes wide.
Because I am using a simple uint64_t array, it implies that a single cache line is going to store 8 positions of this array, because of the read-ahead.
Question:
Because of this simple array, although the threads write to different indexes, my understanding tells that every write to this array will cause a cache invalidation to all other threads?
How could I create a structure that is aligned to the cache line?
EDIT 1
My system is: 2x Intel Xeon E5-2670 2.30GHz (8 cores, 16 threads)
Yes you definitely want to avoid "false sharing" and cache-line ping-pong.
But this probably doesn't make sense: if these memory locations are thread-private more often than they're collected by other threads, they should be stored with other per-thread data so you're not wasting cache footprint on 56 bytes of padding. See also Cache-friendly way to collect results from multiple threads. (There's no great answer; avoid designing a system that needs really fine-grained gathering of results if you can.)
But let's just assume for a minute that unused padding between slots for different threads is actually what you want.
Yes, you need the stride to be 64 bytes (1 cache line), but you don't actually need the 8B you're using to be at the start of each cache line. Thus, you don't need any extra alignment as long as the uint64_t objects are naturally-aligned (so they aren't split across a cache-line boundary).
It's fine if each thread is writing to the 3rd qword of its cache line instead of the 1st. OTOH, aligning to 64B makes sure nothing else is sharing a cache line with the first element, and it's easy so we might as well.
Static storage: aligning static storage is very easy in ISO C11 using alignas(), or with compiler-specific stuff.
With a struct, padding is implicit to make the size a multiple of the required alignment. Having one member with an alignment requirement implies that the whole struct requires at least that much alignment. The compiler takes care of this for you with static and automatic storage, but you have to use aligned_alloc or an alternative for over-aligned dynamic allocation.
#include <stdalign.h> // for #define alignas _Alignas for C++ compat
#include <stdint.h> // for uint64_t
// compiler knows the padding is just padding
struct { alignas(64) uint64_t v; } rdtscp_values[32];
int foo(unsigned t) {
rdtscp_values[t].v = 1;
return sizeof(rdtscp_values[0]); // yes, this is 64
}
Or with an array as suggested by # Eric Postpischil:
alignas(64) // optional, stride will still be 64B without this.
uint64_t rdtscp_values_2d[32][8]; // 8 uint64_t per cache line
void bar(unsigned t) {
rdtscp_values_2d[t][0] = 1;
}
alignas() is optional if you don't care about the whole thing being 64B aligned, just having 64B stride between elements you use. You could also use __attribute__((aligned(64))) in GNU C or C++, or __declspec(align(64)) for MSVC, using #ifdef to define an ALIGN macro that's portable across the major x86 compilers.
Either way produces the same asm. We can check compiler output to verify that we got what we wanted. I put it up on the Godbolt compiler explorer. We get:
foo: # and same for bar
mov eax, edi # zero extend 32-bit to 64-bit
shl rax, 6 # *64 is the same as <<6
mov qword ptr [rax + rdtscp_values], 1 # store 1
mov eax, 64 # return value = 64 = sizeof(struct)
ret
Both arrays are declared the same way, with the compiler requesting 64B alignment from the assembler/linker with the 3rd arg to .comm:
.comm rdtscp_values_2d,2048,64
.comm rdtscp_values,2048,64
Dynamic storage:
If the number of threads is not a compile-time constant, then you can use an aligned allocation function to get aligned dynamically-allocated memory (especially if you want to support a very high number of threads). See How to solve the 32-byte-alignment issue for AVX load/store operations?, but really just use C11 aligned_alloc. It's perfect for this, and returns a pointer that's compatible with free().
struct { alignas(64) uint64_t v; } *dynamic_rdtscp_values;
void init(unsigned nthreads) {
size_t sz = sizeof(dynamic_rdtscp_values[0]);
dynamic_rdtscp_values = aligned_alloc(nthreads*sz, sz);
}
void baz(unsigned t) {
dynamic_rdtscp_values[t].v = 1;
}
baz:
mov rax, qword ptr [rip + dynamic_rdtscp_values]
mov ecx, edi # same code as before to scale by 64 bytes
shl rcx, 6
mov qword ptr [rax + rcx], 1
ret
The address of the array is no longer a link-time constant, so there's an extra level of indirection to access it. But the pointer is read-only after it's initialized, so it will stay shared in cache in each core and reloading it when needed is very cheap.
Footnote: In the i386 System V ABI, uint64_t only has 4B-alignment inside structs by default (without alignas(8) or __attribute__((aligned(8)))), so if you put an int before a uint64_t and didn't do any alignment of the whole struct, it would be possible to get cache-line splits. But compilers align it by 8B whenever possible, so your struct-with padding is still fine.
SOLUTION
So, I followed the comments here and I must say thanks for all contributions.
Finally I got what I expected: cache lines being used properly per each thread.
Here is the shared structure:
typedef struct align_st {
uint64_t v;
uint64_t padding[7];
} align_st_t __attribute__ ((aligned (64)));
I am using a padding uint64_t padding[7] inside the structure to fill the remaining bytes in the cache line when this structure is loaded to the L1 cache. Nonetheless, I am asking to the compiler to use 64-bytes memory alignment when compiling it __attribute__ ((aligned (64))).
So, I allocate this structure dynamically based on the number of cores, using the memalign() for this:
align_st_t *al = (align_st_t*) memalign(64, n_cores * sizeof(align_st_t));
To compare it, I wrote one code version (V1) that uses these aligned mechanisms, and other code version (V2) that uses the simple array method.
By executing with perf, and I got these numbers:
V1: 7.184 cache-misses;
V2: 2.621.347 cache-misses.
P.S.: Each thread is writing 1-thousand times to the same address of the shared structure just to increase the numbers

Kernel Dev: Setting ES:DI in real mode

I'm working on a toy kernel for fun and education (not a class project). I'm starting work on my memory manager, so I'm trying to get the memory map from BIOS using an INT 0x15, EAX=E820 call while still in Real Mode. I'm adapting my function from the osdev wiki (here, in the section "Getting an E820 Memory Map"). However, I want this to be a function I can call from my C code, so I'm trying to change it a bit. I want it to take two arguments: a pointer to where to store the map entries, and a pointer to an integer which will be incremented by the number of entries in the table.
According to the wiki, ES:DI needs to be pointing at where the data should be stored, so I split my first argument into two (the segment selector, pointer_to_map / 16, and the offset, pointer_to_map % 16). Here's part of C code:
typedef struct SMAP_entry {
unsigned int baseL; // Base address, a QWORD
unsigned int baseH;
unsigned int lengthL; // Length, a QWORD
unsigned int lengthH;
unsigned int type; // entry type
unsigned int ACPI; // extra data from ACPI 3.0
} SMAP_entry_t;
SMAP_entry_t data[100];
kprint("Pointer: ");
kprint_int((int) data, 16);
kprint_newline();
int res = 0;
read_mem_map(((int) data) / 16, ((int) data) % 16, &res);
kprint("res: ");
kprint_int(res, 16);
kprint_newline();
Here's part of my ASM code:
; performs a INT 0x15, eax=0xE820 call to find the memory map
; inputs: the pointer to the data table / 16, the pointer % 16, a pointer to an dword (int) which will be
; incremented by the number of entries after this function returns.
; preserves: no registers except esi
read_mem_map:
mov es, [esp + 4] ; set es to the value of the first argument
mov di, [esp + 8] ; set di to the value of the second argument
That's all I'm pasting in because the program triple-faults and shuts down the VM there. By moving ret commands around, I found that the function crashes on the very first line. If I comment out the call in C, then everything works as you'd expect.
I've read through Google that there's almost never a reason to set ES:DI directly, and in the code that I've found which does, they set it to a literal. How should I set ES:DI and if I shouldn't set it directly, how should I make the C and ASM interact in the correct way?
Each of the segment registers (on 80x86) have a visible part, and several hidden fields (the segment base, the segment limit and the segment's attributes - read/write, privilege level, etc).
In protected mode; when you load a segment register the CPU uses the visible part as an index into either the GDT or LDT, and loads the segment's hidden fields from that descriptor (in the GDT or LDT).
In real mode; the CPU does something completely different - it only sets the segment base to "visible part * 16" and doesn't use any (GDT, LDT) table.
Given the fact that you're using a 32-bit pointer to the data table and a 32-bit stack pointer (e.g. mov es, [esp + 4]); I assume your C code is in 32-bit protected mode. This is completely incompatible with real mode, partly because segment loads work completely differently and partly because the default operand/address size is 32-bit and not 16-bit.
All BIOS functions are designed for real mode. They can't be used in protected mode.
Basically; I'd recommend:
pass the pointer to the data table to your assembly as a 32-bit integer/pointer (and not 2 separate 16-bit integers)
call a "go to real mode" function (which will be slightly tricky, as you'd also be switching from a 32-bit stack to a 16-bit stack and will need a "32-bit return instruction" in 16-bit code).
split the pointer to the data table into its segment and offset in assembly, and load the segment (which should work correctly as you're in real mode now)
call the BIOS function (which should work correctly as you're in real mode now)
call a "go to protected mode" function (which will be slightly tricky again, including a "16-bit return instruction" in 32-bit code).
return to the (32-bit protected mode) caller
Instructions for switching from real mode to protected mode, and switching from protected mode to real mode, are included in Intel's system programmer's guide. :)

In ARM Linux, what is the purpose of the few bytes reserved at the "bottom" of kernel stack for each thread

Question:
Why are 8 bytes reserved at the "bottom" of kernel stack when it is created?
Background:
We know that struct pt_regs and thread_info share the same 2 consecutive pages(8192 bytes), with pt_reg located at the higher end and thread_info at the lower end.
However, I noticed that 8 bytes are reserved at the highest address of these 2 pages:
in arch/arm/include/asm/threadinfo.h
#define THREAD_START_SP (THREAD_SIZE - 8)
This way you can access to thread_info structure just by reading stack pointer and masking out THREAD_SIZE bits (otherwise SP initially would be on the next THREAD_SIZE block).
static inline struct thread_info *current_thread_info(void)
{
register unsigned long sp asm ("sp");
return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
}
Eight bytes come from the ARM calling convention that SP needs to be 8-byte aligned.
Update:
AAPCS 5.2.1.1 states:
A process may only access (for reading or writing) the closed interval of the entire stack delimited by [SP, stack-base – 1] (where SP is the value of register r13).
Since stack is full-descending
THREAD_START_SP (THREAD_SIZE - 8)
would enforce this requirement probably by illegal access to next page (segmentation fault).
Why are 8 bytes reserved at the "bottom" of kernel stack when it is created?
If we reserve anything on the stack, it must be a multiple of eight.
If we peek above the stack, we like to make sure it is mapped.
Multiple of eight
The stack and user register needs to be aligned to 8 bytes. This just makes things more efficient as many ARMs have a 64bit bus and operations on the kernel stack (such as ldrd and strd) may have these requirements. You can see the protection in usr_entry macro. Specifically,
#if defined(CONFIG_AEABI) && (__LINUX_ARM_ARCH__ >= 5) && (S_FRAME_SIZE & 7)
#error "sizeof(struct pt_regs) must be a multiple of 8"
#endif
ARMv5 (architecture version 5) adds the ldrd and strd instructions. It is also a requirement of the EABI version of the kernel (versus OABI). So if we reserve anything on the stack, it must be a multiple of 8.
Peeking on stack
For the very top frame, we may want to take a peek at previous data. In order not to constantly check that the stack is in the 8K range an extra entry is reserved. Specifically, I think that signals need to peek at the stack.

C - How to create a pattern in code segment to recognize it in memory dump?

I dump my RAM (a piece of it - code segment only) in order to find where is which C function being placed. I have no map file and I don't know what boot/init routines exactly do.
I load my program into RAM, then if I dump the RAM, it is very hard to find exactly where is what function. I'd like to use different patterns build in the C source, to recognize them in the memory dump.
I've tryed to start every function with different first variable containing name of function, like:
char this_function_name[]="main";
but it doesn't work, because this string will be placed in the data segment.
I have simple 16-bit RISC CPU and an experimental proprietary compiler (no GCC or any well-known). The system has 16Mb of RAM, shared with other applications (bootloader, downloader). It is almost impossible to find say a unique sequence of N NOPs or smth. like 0xABCD. I would like to find all functions in RAM, so I need unique identificators of functions visible in RAM-dump.
What would be the best pattern for code segment?
If it were me, I'd use the symbol table, e.g. "nm a.out | grep main". Get the real address of any function you want.
If you really have no symbol table, make your own.
struct tab {
void *addr;
char name[100]; // For ease of searching, use an array.
} symtab[] = {
{ (void*)main, "main" },
{ (void*)otherfunc, "otherfunc" },
};
Search for the name, and the address will immediately preceed it. Goto address. ;-)
If your compiler has inline asm you can use it to create a pattern. Write some NOP instructions which you can easily recognize by opcodes in memory dump:
MOV r0,r0
MOV r0,r0
MOV r0,r0
MOV r0,r0
How about a completely different approach to your real problem, which is finding a particular block of code: Use diff.
Compile the code once with the function in question included, and once with it commented out. Produce RAM dumps of both. Then, diff the two dumps to see what's changed -- and that will be the new code block. (You may have to do some sort of processing of the dumps to remove memory addresses in order to get a clean diff, but the order of instructions ought to be the same in either case.)
Numeric constants are placed in the code segment, encoded in the function's instructions. So you could try to use magic numbers like 0xDEADBEEF and so on.
I.e. here's the disassembly view of a simple C function with Visual C++:
void foo(void)
{
00411380 push ebp
00411381 mov ebp,esp
00411383 sub esp,0CCh
00411389 push ebx
0041138A push esi
0041138B push edi
0041138C lea edi,[ebp-0CCh]
00411392 mov ecx,33h
00411397 mov eax,0CCCCCCCCh
0041139C rep stos dword ptr es:[edi]
unsigned id = 0xDEADBEEF;
0041139E mov dword ptr [id],0DEADBEEFh
You can see the 0xDEADBEEF making it into the function's source. Note that what you actually see in the executable depends on the endianness of the CPU (tx. Richard).
This is a x86 example. But RISC CPUs (MIPS, etc) have instructions moving immediates into registers - these immediates can have special recognizable values as well (although only 16-bit for MIPS, IIRC).
Psihodelia - it's getting harder and harder to catch your intention. Is it just a single function you want to find? Then can't you just place 5 NOPs one after another and look for them? Do you control the compiler/assembler/linker/loader? What tools are at your disposal?
As you noted, this:
char this_function_name[]="main";
... will end up setting a pointer in your stack to a data segment containing the string. However, this:
char this_function_name[]= { 'm', 'a', 'i', 'n' };
... will likely put all these bytes in your stack so you will be able to recognize the string in your code (I just tried it on my platform).
Hope this helps
Why not get each function to dump its own address. Something like this:
void* fnaddr( char* fname, void* addr )
{
printf( "%s\t0x%p\n", fname, addr ) ;
return addr ;
}
void test( void )
{
static void* fnaddr_dummy = fnaddr( __FUNCTION__, test ) ;
}
int main (int argc, const char * argv[])
{
static void* fnaddr_dummy = fnaddr( __FUNCTION__, main ) ;
test() ;
test() ;
}
By making fnaddr_dummy static, the dump is done once per-function. Obviously you would need to adapt fnaddr() to support whatever output or logging means you have on your system. Unfortunately, if the system performs lazy initialisation, you'll only get the addresses of the functions that are actually called (which may be good enough).
You could start each function with a call to the same dummy function like:
void identifyFunction( unsigned int identifier)
{
}
Each of your functions would call the identifyFunction-function with a different parameter (1, 2, 3, ...). This will not give you a magic mapfile, but when you inspect the code dump you should be able to quickly find out where the identifyFunction is because there will be lots of jumps to that address. Next scan for those jump and check before the jump to see what parameter is passed. Then you can make your own mapfile. With some scripting this should be fairly automatic.

Resources