Runtime detection of Linux/ARM "kuser_helper" functions - c

Normally on Linux/ARM, a special page mapped at 0xffff0000 is used for implementing the "read TLS pointer" operation, atomic compare-and-swap, and memory barriers. This system is called "kuser helpers" (CONFIG_KUSER_HELPERS) and is necessary to work around the lack of support for atomic compare-and-swap in earlier arm models. However, recent kernel versions offer an option to disable this feature on the principle that it's a security risk (facilitating attacks based on return to a fixed executable address, since these functions are not subject to ASLR); this option can be used if all applications are built to make direct use of synchronization instructions available on newer ARM models.
My problem is that I want to be able to support both old ARM models (which lack synchronization instructions) and new hardened kernels (which lack kuser helpers) with the same binaries, so I'm looking for a reliable way, from userspace, to detect the availability of the kuser helper page (using it if it's available, and assuming if it's not that the newer instructions must be available). Reliable excludes things like /proc that might not always be available. Is there any way to probe for the existence of the kuser helper page short of trying to use it and trapping SIGSEGV?

the vectors page is set up in arch/arm/kernel/traps.c:early_trap_init() during kernel initialisation and will still be present, just without the helpers, so you shouldn't get a SEGV in the first place; for the same reason the mmap trick won't work (i haven't checked either of these assumptions).
but: the vectors page is initialised to zero by early_alloc_aligned(), so you're lucky, since the number of kuser_helpers at 0xffff0ffc will not be filled in and is thus zero.
tl;dr: read the number of kuser helpers from 0xffff0ffc. if zero => no support for them

Related

Custom handling of memory reads and writes in C

I am working on writing my own malloc and using the LD_PRELOAD trick to use it. I need to be able to perform custom functionality for every memory access to the heap, both reads and writes (performance is not a concern, functionality is the goal).
For example, for some code like
int x = A[5];
I would like to be able to trap the read from (A + 5) and instead of reading from that memory location, return my own custom value to store in x.
The ideas I have as of now are:
mprotect away, handling the resulting SIGSEGVs and doing what I need to in the handler. As far as I know, I can access the faulty address in void *si_addr, but I'm not sure how to distinguish between a read and a write - and even if I did manage to do so, I'm not sure how to handle writes since I wouldn't know the value to be written within the handler.
Tweak gcc to handle memory accesses specially. From what I have read, understanding gcc code takes a while, and unless its IR/abstract assembly conveniently isolates memory loads/stores, I'm not sure how practical this is.
Any suggestions are appreciated.
The simplest is via malloc ( you might want to own mmap, munmap, mprotect, sig(action, nal, etc) ... for full coverage ). Yours return addresses which do not correspond to valid mappings, capture SIGBUS + SIGSEGV, interpret the siginfo structure to fixup your process, ... But this is somewhat limited to operating on the heap, and a program can readily escape from it, and if you are trying to catch a misbehaving program, the program might corrupt your lookup tables.
For fuller coverage, you might want to take a look at gvisor, which is billed as a container runtime sandbox. Its technology is closer to a debugger, as it takes full control over the target, capturing its faults, system calls, etc.. and manages its address space. It would likely be minor surgery to adapt it to your needs.
In either situation, when you take a fault, you have to either install the memory and restart the program or emulate the instruction. If you are dealing with a clean architecture like riscv or ARM, emulation isn’t too bad, but for an over-indulgent one like x86, you pretty much need to integrate qemu. If you take the gvisor-like approach, you can install the page and set the single-step flag, then remove the page on the single-step trap, which is a bit less cumbersome. There was a precursor to dtrace, called atrace, that used this approach to analyze cache and tlb access patterns.
Sounds like a fun project; I hope it goes well.

Threading and Thread Safety in C

When there is a common set of global data that needs to be shared among several threaded processes, I typically have used a thread token to protect the shared resource:
Edit - 7/22/15 (to incorporate atomics as a viable option, per Jens comments)
My [First] question is, in C, if I write my routines in such a way as to guarantee each thread accesses one, and only one element of an array:
Is there any reason to think that asynchronous and simultaneous access to different indices of the same unprotected array (as shown in diagram) would be a problem?
Second question: Given that an object that can be accessed as
an atomic entity, even in the presence of asynchronous interrupts ( C99 - 7.14 Signal handling ) would using atomics be an effective method for thread protection for an otherwise unprotected variable?
Edit (Clarifications to address questions in comments to this point):
- Specifics for this application:
- Target OS: Windows 7/8/10
- Compiler : C99 compliant (cannot use C11, which include the _Atomic() type specifier )
- H/W : Intel i7 family
This (which looks like a C standard of some sort)
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf sayeth:
NOTE 1 Two threads of execution can update and access separate memory
locations without interfering with each other
NOTE 13 Compiler transformations that introduce assignments to a
potentially shared memory location that would not be modified by the
abstract machine are generally precluded by this standard, since such
an assignment might overwrite another assignment by a different thread
in cases in which an abstract machine execution would not have
encountered a data race. This includes implementations of data member
assignment that overwrite adjacent members in separate memory
locations. We also generally preclude reordering of atomic loads in
cases in which the atomics in question may alias, since this may
violate the "visible sequence" rules.
The way I understand it, this would preclude quamrana's concerns and guarantee you that unprotected writes to separate memory locations should never result in undefined behavior if there is no data race.
In C it will depend on your platform, that is your combination of compiler, processor architecture and operating system.
Your compiler can choose how to use the internal registers and instructions of the cpu to make the executable seem to perform the intent of the program. And C may know nothing about threads. It is usually the job of the operating system to provide a threading library.
There may be processors which might perform the write to an element of your array by reading a much larger patch of memory than just one element, then overwrite just the right bits that forms one element within internal registers and then writing the whole patch back. A single threaded program would work just fine, but two or more threads which interrupt each other could cause chaos in the array.
On the other hand it may work out just fine.
And as has been said, read-only access is always just fine.
Also, google is your friend. It found this stackoverflow question.
If each thread is accessing a different array element, and only the element it is "assigned", this shouldn't be a problem. Both scenarios above are essentially equivalent, since each array element has its own address.

Erasing sensitive information from memory

After reading this question I'm curious how one would do this in C. When receiving the information from another program, we probably have to assume that the memory is writable.
I have found this stating that a regular memset maybe optimized out and this comment stating that memsets are the wrong way to do it.
The example you have provided is not quite valid: the compiler can optimize out a variable setting operation when it can detect that there are no side effects and the value is no longer used.
So, if your code uses some shared buffer, accessible from multiple locations, the memset would work fine. Almost.
Different processors use different caching policies, so you might have to use memory barriers to ensure the data (zero's) have reached memory chip from the cache.
So, if you are not worried about hardware level details, making sure compiler can't optimize out operation is sufficient. For example, memsetting block before releasing it would be executed.
If you want to ensure the data is removed from all hardware items, you need to check how the data caching is implemented on your platform and use appropriate code to force cache flush, which can be non-trivial on multi-core machine.

Atomic reads in C

According to Are C++ Reads and Writes of an int Atomic?, due to issues of processor caching, reads of ints (and thusly pointers--or so I assume) are not atomic in C. So, my question is is there some assembly that I could use to make the read atomic, or do I need to use a lock? I looked at several sets of libraries of atomic operations, and, as of yet, I am unable to find a function for an atomic read.
EDIT: Compiler: Clang 2.9
EDIT: Platform: x86 (64-bit)
Thanks.
In general, a simple atomic fetch isn't provided by atomic operations libraries because it's rarely used; you read the value and then do something with it, and the lock needs to be held during that something so that you know that the value you read hasn't changed. So instead of an atomic read, there is an atomic test-and-set of some kind (e.g. gcc's __sync_fetch_and_add()) which performs the lock, then you perform normal unsynchronized reads while you hold the lock.
The exception is device drivers where you may have to actually lock the system bus to get atomicity with respect to other devices on the bus, or when implementing the locking primitives for atomic operations libraries; these are inherently machine-specific, and you'll have to delve into assembly language. On x86 processors, there are various atomic instructions, plus a lock prefix that can be applied to most operations that access memory and hold a bus lock for the duration of the operation; other platforms (SPARC, MIPS, etc.) have similar mechanisms, but often the fine details differ. You will have to know the CPU you're programming for and quite probably have to know something about the machine's bus architecture in this case. And libraries for this rarely make sense, because you can't hold bus or memory locks across function entry/exit, and even with a macro library one has to be careful because of the implication that one could intersperse normal operations between macro calls when in fact that could break locking. It's almost always better to just code the entire critical section in assembly language.
gcc has a set of atomic builtin functions, but it does not have a plain atomic fetch, however you could do something like __sync_fetch_and_add(&<your variable here>, 0); to work around that
GCC docs are here and there's that blog post above
EDIT: Ah, clang, I know LLVM IR has atomics in it, but I don't know if clang exposes them in any way, but it might be worth a shot to see if it complains about using the gcc ones, it might support them. EDIT: hmm it seems to have something... clang docs doesn't do as much as gcc though, and the docs seem to suggest it may also do the gcc ones.

Mechanism of the Boehm Weiser Garbage Collector

I was reading the paper "Garbage Collector in an Uncooperative Environment" and wondering how hard it would be to implement it. The paper describes a need to collect all addresses from the processor (in addition to the stack). The stack part seems intuitive. Is there any way to collect addresses from the registers other than enumerating each register explicitly in assembly? Let's assume x86_64 on a POSIX-like system such as linux or mac.
SetJmp
Since Boehm and Weiser actually implemented their GC, then a basic source of information is the source code of that implementation (it is opensource).
To collect the register values, you may want to subvert the setjmp() function, which saves a copy of the registers in a custom structure (at least those registers which are supposed to be preserved across function calls). But that structure is not standardized (its contents are nominally opaque) and setjmp() may be specially handled by the C compiler, making it a bit delicate to use for anything other than a longjmp() (which is already quite hard as it is). A piece of inline assembly seems much easier and safer.
The first hard part in the GC implementation seems to be able to reliably detect the start and end of stacks (note the plural: there may be threads, each with its own stack). This requires delving into ill-documented details of OS ABI. When my desktop system was an Alpha machine running FreeBSD, the Boehm-Weiser implementation could not run on it (although it supported Linux on the same processor).
The second hard part will be when trying to go generational, trapping write accesses by playing with page access rights. This again will require reading some documentation of questionable existence, and some inline assembly.
I think on x86_86 they use the flushrs assembly instruction to put the registers on the stack. I am sure someone on stack overflow will correct me if this is wrong.
It is not hard to implement a naive collector: it's just an algorithm after all. The hard bits are as stated, but I will add the worst ones: tracking exceptions is nasty, and stopping threads is even worse: that one can't be done at all on some platforms. There's also the problem of trapping all pointers that get handed over to the OS and lost from the program temporarily (happens a lot in Windows window message handlers).
My own multi-threaded GC is similar to the Boehm collector and more or less standard C++ with few hacks (using jmpbuf is more or less certain to work) and a slightly less hostile environment (no exceptions). But it stops the world by cooperation, which is very bad: if you have a busy CPU the idle ones wait for it. Boehm uses signals or other OS features to try to stop threads but the support is very flaky.
And note also the Intel i64 processor has two stacks per thread .. a bit hard to account for this kind of thing generically.

Resources