Is there a crypto backend for cryptsetup that either is always thread safe, or can be easily used (or even modified, preferably with minimal effort) in a thread safe manner for simply testing if a key is correct?
Background and what I have tried:
I started by testing if I could modify the source of cryptsetup to simply test multiple keys using pthreads. This crashed, I believe I used gcrypt initially. Eventually I tried all of the backends available in the cryptsetup stable source and found that openssl and nettle seems to avoid crashing.
However, my testing was not very thorough and even though it (nettle specifically) does not crash, it seems that it does not work correctly when using threads. When using a single thread it always works, but increasing the number of threads makes it increasingly likely it will silently never find the correct key.
This is for brute forcing LUKS devices. I am aware the pbkdf slows it down to a crawl. I'm also aware the key space of AES cannot be exhausted even if the KDF was not there. This is just for the fun of making it in a network distributed and multithreaded manner.
I noticed in the source of cryptsetup (libdevmapper.c):
/*
* libdevmapper is not context friendly, switch context on every DM call.
* FIXME: this is not safe if called in parallel but neither is DM lib.
*/
However, it is possible I'm simply not using it correctly.
if(!LUKS_open_key_with_hdr(CRYPT_ANY_SLOT, key, strlen(key), &cd->u.luks1.hdr, &vk, cd)) {
return 0;
}
Each worker thread does this. I only call crypt_init() and crypt_load() once before the worker threads start up and pass them their own separate copy of the struct crypt_device. vk is created locally for each attempt. The keys are simply fetched from a wordlist with access control by a mutex. I found that if each thread calls these functions (crypt_init and crypt_load) every time, it seems to crash more easily.
Is it completely incorrect to try start removing and rewriting the code that uses dmcrypt? In LUKS_endec_template() it attaches a loop device to the crypto device, and creates a dm device which it eventually gives to open(), which it then gives the fd of to read_blockwise(). My idea was to simply skip all of that since I don't really need to use the device except to verify the key. However, simply opening the crypto device directly (and give it to read_blockwise()) does not work.
Related
I'm in the midst of wrapping a C library with cgo to be usable by normal Go code.
My problem is that I'd like to propagate error strings up to the Go API, but the C library in question makes error strings available via thread-local storage; there's a global get_error() call that returns a pointer to thread local character data.
My original plan was to call into C via cgo, check if the call returned an error, and if so, wrap the error string using C.GoString to convert it from a raw character pointer into a Go string. It'd look something like C.GoString(C.get_error()).
The problem that I foresee here is that TLS in C works on the level of native OS threads, but in my understanding, the calling Go code will be coming from one of potentially N goroutines that are multiplexed across some number of underlying native threads in a thread pool managed by the Go scheduler.
What I'm afraid of is running into a situation where I call into the C routine, then after the C routine returns, but before I copy the error string, the Go scheduler decides to swap the current goroutine out for another one. When the original goroutine gets swapped back in, it could potentially be on a different native thread for all I know, but even if it gets swapped back onto the same thread, any goroutines that ran there in the intervening time could've changed the state of the TLS, causing me to load an error string for an unrelated call.
My questions are these:
Is this a reasonable concern? Am I misunderstanding something about the go scheduler, or the way it interacts with cgo, that would cause this to not be an issue?
If this is a reasonable concern, how can I work around it?
cgo somehow manages to propagate errno values back to the calling Go code, which are also stored in TLS, which makes me think there must be a safe way to do this.
I can't think of a way that the C code itself could get preempted by the go scheduler, so should I introduce a wrapper C function and have IT make the necessary call and then conditionally copy the error string before returning back up to goland?
I'm interested in any solution that would allow me to propagate the error strings out to the rest of Go, but I'm hoping to avoid any solution that would require me to serialize accesses around the TLS, as adding a lock just to grab an error string seems greatly unfortunate to me.
Thanks in advance!
What I'm afraid of is running into a situation where I call into the C routine, then after the C routine returns, but before I copy the error string, the Go scheduler decides to swap the current goroutine out for another one. ...
Is this a reasonable concern?
Yes. The cgo "call C code" wrappers lock on to one POSIX / OS thread for the duration of each call, but the thread they lock is not fixed for all time; it does in fact bop around, as it were, to multiple different threads over time, as long as your goroutines are operating normally. (Since Go is cooperatively scheduled in the current implementations, you can, in some circumstances, be careful not to do anything that might let you switch underlying OS threads, but this is probably not a good plan.)
You can use runtime.LockOSThread here, but I think the best plan is otherwise:
how can I work around it?
Grab the error before Go resumes its normal scheduling algorithm (i.e., before unlocking the goroutine from the C / POSIX thread).
cgo somehow manages to propagate errno values ...
It grabs the errno value before unlocking the goroutine from the POSIX thread.
My original plan was to call into C via cgo, check if the call returned an error, and if so, wrap the error string using C.GoString to convert it from a raw character pointer into a Go string. It'd look something like C.GoString(C.get_error()).
If there is a variant of this that takes the error number (rather than fishing it out of a TLS variable), that plan should still work: just make sure that your C routines provide both the return value and the error number.
If not, write your own C wrapper, just as you suggested:
ftype wrapper_for_realfunc(char **errp, arg1type arg1, arg2type arg2) {
ftype ret = realfunc(arg1, arg2);
if IS_ERROR(ret) {
*errp = get_error();
} else {
*errp = NULL;
}
return ret;
}
Now your Go wrapper simply calls the wrapper, which fills in a pointer to C memory with an extra *C.char argument, setting it to nil if there is no error, and setting it to something on which you can use C.GoString if there is an error.
If that's not feasible for some reason, consider using runtime.LockOSThread and its counterpart, runtime.UnlockOSThread.
Task
I have a small kernel module I wrote for my RaspBerry Pi 2 which implements an additional system call for generating power consumption metrics. I would like to modify the system call so that it only gets invoked if a special user (such as "root" or user "pi") issues it. Otherwise, the call just skips the bulk of its body and returns success.
Background Work
I've read into the issue at length, and I've found a similar question on SO, but there are numerous problems with it, from my perspective (noted below).
Question
The linked question notes that struct task_struct contains a pointer element to struct cred, as defined in linux/sched.h and linux/cred.h. The latter of the two headers doesn't exist on my system(s), and the former doesn't show any declaration of a pointer to a struct cred element. Does this make sense?
Silly mistake. This is present in its entirety in the kernel headers (ie: /usr/src/linux-headers-$(uname -r)/include/linux/cred.h), I was searching in gcc-build headers in /usr/include/linux.
Even if the above worked, it doesn't mention if I would be getting the the real, effective, or saved UID for the process. Is it even possible to get each of these three values from within the system call?
cred.h already contains all of these.
Is there a safe way in the kernel module to quickly determine which groups the user belongs to without parsing /etc/group?
cred.h already contains all of these.
Update
So, the only valid question remaining is the following:
Note, that iterating through processes and reading process's
credentials should be done under RCU-critical section.
... how do I ensure my check is run in this critical section? Are there any working examples of how to accomplish this? I've found some existing kernel documentation that instructs readers to wrap the relevant code with rcu_read_lock() and rcu_read_unlock(). Do I just need to wrap an read operations against the struct cred and/or struct task_struct data structures?
First, adding a new system call is rarely the right way to do things. It's best to do things via the existing mechanisms because you'll benefit from already-existing tools on both sides: existing utility functions in the kernel, existing libc and high-level language support in userland. Files are a central concept in Linux (like other Unix systems) and most data is exchanged via files, either device files or special filesystems such as proc and sysfs.
I would like to modify the system call so that it only gets invoked if a special user (such as "root" or user "pi") issues it.
You can't do this in the kernel. Not only is it wrong from a design point of view, but it isn't even possible. The kernel knows nothing about user names. The only knowledge about users in the kernel in that some privileged actions are reserved to user 0 in the root namespace (don't forget that last part! And if that's new to you it's a sign that you shouldn't be doing advanced things like adding system calls). (Many actions actually look for a capability rather than being root.)
What you want to use is sysfs. Read the kernel documentation and look for non-ancient online tutorials or existing kernel code (code that uses sysfs is typically pretty clean nowadays). With sysfs, you expose information through files under /sys. Access control is up to userland — have a sane default in the kernel and do things like calling chgrp, chmod or setfacl in the boot scripts. That's one of the many wheels that you don't need to reinvent on the user side when using the existing mechanisms.
The sysfs show method automatically takes a lock around the file, so only one kernel thread can be executing it at a time. That's one of the many wheels that you don't need to reinvent on the kernel side when using the existing mechanisms.
The linked question concerns a fundamentally different issue. To quote:
Please note that the uid that I want to get is NOT of the current process.
Clearly, a thread which is not the currently executing thread can in principle exit at any point or change credentials. Measures need to be taken to ensure the stability of whatever we are fiddling with. RCU is often the right answer. The answer provided there is somewhat wrong in the sense that there are other ways as well.
Meanwhile, if you want to operate on the thread executing the very code, you can know it wont exit (because it is executing your code as opposed to an exit path). A question arises what about the stability of credentials -- good news, they are also guaranteed to be there and can be accessed with no preparation whatsoever. This can be easily verified by checking the code doing credential switching.
We are left with the question what primitives can be used to do the access. To that end one can use make_kuid, uid_eq and similar primitives.
The real question is why is this a syscall as opposed to just a /proc file.
See this blogpost for somewhat elaborated description of credential handling: http://codingtragedy.blogspot.com/2015/04/weird-stuff-thread-credentials-in-linux.html
I am writing a tool. A part of that tool will be its ability to log the parameters of the system calls. Alright I can use ptrace for that purpose, but ptrace is pretty slow. A faster method that came to my mind was to modify the glibc. But this is getting difficult, as gcc magically inserts its own built in functions as system call wrappers than using the code defined in glibc. Using -fno-builtin is also not helping there.
So I came up with this idea of writing a shared library, which includes every system call wrapper, such as mmap and then perform the logging before calling the actual system call wrapper function. For example pseudo code of what my mmap would look like is given below.
int mmap(...)
{
log_parameters(...);
call_original_mmap(...);
...
}
Then I can use LD_PRELOAD to load this library firstup. Do you think this idea will work, or am I missing something?
No method that you can possibly dream up in user-space will work seamlessly with any application. Fortunately for you, there is already support for doing exactly what you want to do in the kernel. Kprobes and Kretprobes allow you to examine the state of the machine just preceeding and following a system call.
Documentation here: https://www.kernel.org/doc/Documentation/kprobes.txt
As others have mentioned, if the binary is statically linked, the dynamic linker will skip over any attempts to intercept functions using libdl. Instead, you should consider launching the process yourself and detouring the entry point to the function you wish to intercept.
This means launching the process yourself, intercepting it's execution, and rewriting it's memory to place a jump instruction at the beginning of a function's definition in memory to a new function that you control.
If you want to intercept the actual system calls and can't use ptrace, you will either have to find the execution site for each system call and rewrite it, or you may need to overwrite the system call table in memory and filtering out everything except the process you want to control.
All system calls from user-space goes through a interrupt handler to switch to kernel mode, if you find this handler you probably can add something there.
EDIT I found this http://cateee.net/lkddb/web-lkddb/AUDITSYSCALL.html. Linux kernels: 2.6.6–2.6.39, 3.0–3.4 have support for system call auditing. This is a kernel module that has to be enabled. Maybe you can look at the source for this module if it's not to confusing.
If the code you are developing is process-related, sometimes you can develop alternative implementations without breaking the existing code. This is helpful if you are rewriting an important system call and would like a fully functional system with which to debug it.
For your case, you are rewriting the mmap() algorithm to take advantage of an exciting new feature(or enhancing with new feature). Unless you get everything right on the first try, it would not be easy to debug the system: A nonfunctioning mmap() system call is certain to result in a nonfunctioning system. As always, there is hope.
Often, it is safe to keep the remaining algorithm in place and construct your replacement on the side. You can achieve this by using the user id (UID) as a conditional with which to decide which algorithm to use:
if (current->uid != 7777) {
/* old algorithm .. */
} else {
/* new algorithm .. */
}
All users except UID 7777 will use the old algorithm. You can create a special user, with UID 7777, for testing the new algorithm. This makes it much easier to test critical process-related code.
What is the best way to permit C code to regularly access the instantaneous value of an integer generated from a separate Labview program?
I have time-critical C code that controls a scientific experiment and records data once every 20ms. I also have some labview code that operates a different instrument and outputs an integer value ever 100ms. I want my C code to be able to record the value from labview. What is the best way to do this?
One idea is to have Labview write the integer to file in a loop, and have the C code read the value of the file in a loop. (I could add a second thread to my C code if necessary.) Labview can also link to C dll's. So I might be able to write a DLL in C that somehow facilitates sharing between the two programs. Is that advisable? How would I do that?
I have a similar application here and use TCP sockets with the TCP_NO_DELAY option set (disables the Nagle algorythm which does some sort of packet buffering). Sockets should allow for a 100mSec update rate without problems, although the actual network delay will always remain an unknown variable. For my application this does not matter as long as it stays under a certain limit (this is also checked for by sending a timestamp with each packet and big red dialog boxes if timestamp delta becomes too large :]). Does it matter for your application? Ie, is it important that whenever the LV instrument acquires a new sample it's value has to make it to the C app within x mSec?
You might get the dll approach working, but it's not as straightforward as sockets and it will make the two applications more dependant of each other. Variable acces will be pretty much instantaneous though. I see at least two possibilities:
put your entire C app in a dll (might seem a weird approach at first but it works), and have LV load it and call methods on it. Eg to start your app LV calls dll's Start() method, then in the loop LV acquires it's samples it calls the dll's NewSampleValue(0 method or so. Also means your app cannot run standalone unless you write a seperate host process for it.
look into shared process memory, and have the C app and another dll share common memory. LV will load that dll and call a method on it to write a value to the shared memory, then the C app can read it after polling a flag (which needs a lock!).
it might also be possible to have the C app call the LV program using dll/activeX/? calls but I don't know how that system works..
I would definitely stay away from the file approach: disk I/O can be a real bottleneck and it also has the locking problem which is messy to solve with files. C app cannot read the file while LV is writing it and vice-versa which might introduce extra delays.
On a sidenote, you can see that each of the approaches above either use a push or pull model (the TCP one can be implemented in both ways), this might affect your final decision of which way to go.. Push = LV signals the C app directly, pull = C app has to poll a flag or ask LV for the value.
I'm an employee at National Instruments and I wanted to make sure you didn't miss the Network Variable API that is provided with LabWindows/CVI, the National Instruments C development environment. The the Network Variable API will allow you to easily communicate with the LabVIEW program over Shared Variables (http://zone.ni.com/devzone/cda/tut/p/id/4679). While reading these links, note that a Network Variable and a Shared Variable are the same thing - the different names are unfortunate...
The nice thing about the Network Variable API is that it allows easy interoperability with LabVIEW, it provides a strongly typed communication mechanism, and it provides a callback model for notification when the Network/Shared variable's properties (such as value) change.
You can obtain this API by installing LabWindows/CVI, but it is not necessary to use the LabWindows/CVI environment. The header file is available at C:\Program Files\National Instruments\CVI2010\include\cvinetv.h, and the .lib file located at C:\Program Files\National Instruments\CVI2010\extlib\msvc\cvinetv.lib can be linked in with whatever C development tools you are using.
I followed up on one of #stijn's ideals:
have the C app and another dll share common memory. LV will load that dll and call a method on it to write a value to the shared memory, then the C app can read it after polling a flag (which needs a lock!).
I wrote the InterProcess library, available here: http://github.com/samuellab/InterProcess
InterProcess is a compact general library that sets up windows shared memory using CreateFileMapping() and MapViewOfFile(). It allows the user to seamlessly store values of any type (int, char, your struct.. whatever) in an arbitrary number of named fields. It also implements Mutex objects to avoid collisions and race conditions, and it abstracts away all of this in a clean and simple interface. Tested on Windows XP. Should work with any modern Windows.
For interfacing between my existing C code and labview, I wrote a small wrapper DLL that sits on top of InterProcess and exposes only the specific functions that my C code or labview need to access. In this way, all of the shared memory is completely abstracted away.
Hopefully someone else will find this code useful.
How can I do dataflow (pipes and filters, stream processing, flow based) in C? And not with UNIX pipes.
I recently came across stream.py.
Streams are iterables with a pipelining mechanism to enable data-flow programming and easy parallelization.
The idea is to take the output of a function that turns an iterable into another iterable and plug that as the input of another such function. While you can already do this using function composition, this package provides an elegant notation for it by overloading the >> operator.
I would like to duplicate a simple version of this kind of functionality in C. I particularly like the overloading of the >> operator to avoid function composition mess. Wikipedia points to this hint from a Usenet post in 1990.
Why C? Because I would like to be able to do this on microcontrollers and in C extensions for other high level languages (Max, Pd*, Python).
* (ironic given that Max and Pd were written, in C, specifically for this purpose – I'm looking for something barebones)
I know, that it's not a good answer, but you should make your own simple dataflow framework.
I've written a prototype DF server (together with a friend of mine), which have several unimplemented features yet: it can only pass Integer and Trigger data in messages, and it does not supports paralellism. I've just skipped this work: the components' producer ports have a list of function pointers to consumer ports, which are set up upon the initialization, and they call it (if the list is not empty). So, when an event fires, the components perform a tree-like walk-thru of the dataflow graph. As they work with Integers and Triggers, it's extremly quick.
Also, I've written a strange component, which have one consumer and one producer port, it just simply passes the data thru - but in another thread. It's consumer routine finishes quickly, as it just puts the data and sets a flag to the producer-side thread. Dirty, but it suits my needs: it detaches long processes of the tree-walk.
So, as you may recognized, it's a low-traffic asynchronous system for quick tasks, where the graph size does not matter.
Unfortunatelly, your problem differs as many points from mine, just as many one dataflow system can differ from another, you need a synchronous, paralell, stream handling solution.
I think, the biggest issue in a DF server is the dispatcher. Concurrency, collision, threads, priority... as I said, I've just skipped the problem, not solved. You should skip it, too. And you also should skip other problems.
Dispatcher
In case of a synchronous DF architecture, all the components must run once per cycle, except special cases. They have a simple precondition: is the input data available? So, you should just to scan thru the components, and pass them to a free caller thread, if data is available. After processing all of them, you will have N remaining components, which haven't processed. You should process the list again. After the second processing you will have M remainings. If N == M, the cycle is over.
I think some kind of same stuff will work, if the number of components is below only 100.
Binding
Yep, the best way of binding is the visual programming. Until finishing the editor, config-like code should used insetad, something like:
// disclaimer: not actual code
Component* c1 = new AddComponent();
Component* c2 = new PrintComponent();
c2->format = "The result is %d\n";
bind(c1->result,c2->feed);
It's easy to write, well-readable, other wish?
Message
You should pass pure raw packets among components' ports. You need only a list of bindings, which contain pairs of pointers of producer and consumer ports, and contains the processed flag, which the "dispatcher" uses.
Calling issue
The problem is that producer should not call the consumer port, but the component; all component (class) variables and firings are in the component. So, the producer should call the component's common entry point directly, passing the consumer's ID to it, or it should call the port, which should call any method of the component which it belongs.
So, if you can live with some restrictions, I say, go ahead, and write your lite framework. It's a good task, but writing small components and see, how smart can they wired together building a great app is the ultimate fun.
If you have further questions, feel free to ask, I often scan the "dataflow" keyword here.
Possibly, you can figure out a more simple dataflowish model for your program.
I'm not aware of any library for such purpose. Friend of mine implemented something similar in versity as a lab assignment. Main problems of such systems is low performance (really bad if functions in long pipe-lines are smallish) and potential need to implement scheduling (detecting dead-locks and boosting priority to avoid overload of pipe buffer).
From my experience with similar data processing, error handling is quite burdensome. Since functions in the pipeline know little of the context (intentionally, for reusability) they can't produce sensible error message. One can implement in-line error handling - passing errors down the pipe as data - but that would require special handling all over the place, especially on the output as it is not possible with streams to correlate to what input the error corresponds.
Considering known performance problems of the approach, it is hard for me to imagine how that would fit microcontrollers. Performance-wise, nothing beats a plain function: one can create a function for every path through the data pipe-line.
Probably you can look for some Petri net implementation (simulator or code generator), as they are one of the theoretical base for streams.
This is cool: http://code.google.com/p/libconcurrency/
A lightweight concurrency library for C, featuring symmetric coroutines as the main control flow abstraction. The library is similar to State Threads, but using coroutines instead of green threads. This simplifies inter-procedural calls and largely eliminates the need for mutexes and semaphores for signaling.
Eventually, coroutine calls will also be able to safely migrate between kernel threads, so the achievable scalability is consequently much higher than State Threads, which is purposely single-threaded.
This library was inspired by Douglas W. Jones' "minimal user-level thread package". The pseudo-platform-neutral probing algorithm on the svn trunk is derived from his code.
There is also a safer, more portable coroutine implementation based on stack copying, which was inspired by sigfpe's page on portable continuations in C. Copying is more portable and flexible than stack switching, and making copying competitive with switching is being researched.