Fibers / Coroutines vs Delimited Continuations - theory

So I read a paper about concurrent work stealing deques here: http://open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3872.pdf. They mention "Child-Stealing vs Continuation Stealing" and they say that child stealing can require unbounded stack space to hold yet-to-be executed tasks unlike continuation stealing which is a constant factor of P=#processors.
I have a theoretical questions on the difference between fibers / coroutines and delimited continuations. First, I accept that coroutines and fibers are pretty much equivalent, but are fibers also equivalent to continuations? I have a sneaking suspicion that what I'm about to implement is fundamentally wrong (ie. replacing threads with fibers, and not actually implementing a version that doesn't require unbounded memory).

At the conceptual level, "stackful coroutines" is equivalent with "one-shot delimited continuation" or "one-short continuation" in terms of their expressivity, and fibers means stackful coroutines with a (often plugable) scheduler, see.

Related

Nussinov Parallel

I am a biologist and I am trying to study computer languages. But, when I was trying to learn about the lpthread library, it seems odd as the result was lower than the sequential version.
In fact I am still reading the Tanenbaum book. But my main focus is to learn the basics of the calculations of the secondary structure of RNAs. So I found the explanation to the nussinov algorithm in a book and did indeed implement it. But when I tried to make a parallel version I believe that I might be missing the whole point, as this is my first contact with parallel implementations.
My questions are:
1. How should I implement a data-parallelism version for this algorithm ?
2. Why is my implementation slightly slower than the sequential one?
The code is available on: https://gist.github.com/drenge/6395472 (each file is a different version parallel/sequential)
there are two ways to make a parallel version of an algorithm/program.
You study the algorithm and write the serial program. Afterwards, you start profiling the program to see where you can obtain speed gains. Those are the places where parallelism might come in handy (might, not will). I call this method the "desparate man's tool". This method is useful (!), but most of the times, the method beneath can provide better performance gains. This way of doing the optimisation method only takes programming and user experience into account.
You take the algorithm and try to figure out an other algorithm that permits parallel handling of the problem. Are there independent calculations or steps in the algorithm, are there parts of the algorithm that can be done before other parts completely finish, ... This could be called "the theoretical approach". Keep in mind that every thread has its overhead, and you don't want the overhead to be bigger than the gain you wish to obtain.
In fact, a combination of both is the best way to go (if parallelism is really necessary): first concentrate on method 2 (optimise the algorithm so that is stays scientifically correct, but can be treated in multi threading). Then look at the critical thread (can be found while profiling) and start optimising that thread.
As Kerrek SB already told: parallel programming is a very complex topic, with lots of possible pitfalls. And at the end of the road, you should ask yourself: is it worth the effort. After all: loosing weeks of study and programming time to gain some minutes is not worth your while.
On the other hand, if your program will run thousands of times, frustrating users due to long waiting times or a lack of responsiveness, than maybe, it could be useful to make a more performant version after all. But again: can't you reach the same goal by optimising a sequential version without the parallel clutter? Lot's of algorithms are of order O(exp(x)) or worse and can be reduced to O(x) or even O(log(x)).
Kind regards,
PB

How can I implement cooperative lightweight threading with C on Mac OS X?

I'm trying to find a lightweight cooperative threading solution to try implementing an actor model.
As far as I know, the only solution is setcontext/getcontext,
but the functionality is deprecated(?) by Apple. I'm confused by why they did this; however, I'm finding replacement for this.
Pthreads are not an option because I need cooperative model instead of preemptive model to control context switching timing precisely/manually without expensive locking.
-- edit --
Reason of avoiding pthreads:
Because pthreads are not cooperative/deterministic and too expensive. I need actor model for game logic code, so thousand of execution context are required at minimal. Hardware threading requires MB of memory and expense to create/destruct. And parallelism is not important. In fact, I just need concurrent execution of many functions. This can be implemented with many divided functions and some kind of object model, but my goal is reducing those overheads.
If I know something wrong, please correct me. It'll be very appreciated.
The obvious 'lightweight' solution is to avoid complex nested calling except for limited situations where the execution time will be tightly bounded, then store an explicit state structure for each "thread" and implement the main program logic as a state machine that's easily suspendable/resumable at most points. Then you can simply swap out the pointer to the state structure for 'context switch'. Basically this technique amounts to keeping all of your important state variables, including what would conventionally be local variables, in the state structure.
Whether this is worthwhile probably depends on your reason for avoiding pthreads. If your reason is to be portable to non-POSIX systems, or if you really need deterministic program flow, then it may be worthwhile. But if you're just worried about performance overhead and memory synchronization issues, I think you should use pthreads and manage these issues. If you avoid unnecessary locking, use fine-grained locks, and minimize the amount of time locks are held, performance should not suffer.
Edit: Based on your further details posted in the comments on the main question, I think the solution I've proposed is the right one. Each actor should have their own context in which you store the state of the actor's action/thinking/etc. You would have a run_actor function which would take an actor context and a number of "ticks" to advance the actor's state by, and a run_all_actors function which would iterate over a list of active actors and call run_actor for each with the specified number of ticks.
Further, note that this solution still allows you to use real threads to take advantage of SMP/multicore machines. You simply divide the actors up between threads. You may need some degree of locking if one actor needs to examine another's context (e.g. for collision detection).
I was researching this question as well, and I ran across GNU Pth (not to be confused with Pthreads). See http://www.gnu.org/software/pth/
It aims to be a portable solution for cooperative threads. It does mention it is implemented via setcontext/getcontext if available (so it may not be on Mac OSX). Otherwise it says it uses longjmp/setjmp, but it's not clear to me how that works.
Hope this is helpful to anyone who searches for this question.
I have discovered the some of required functionalities from setcontext/getcontext are implemented in libunwind.
Unfortunately the library won't be compiled on Mac OS X because of deprecation of the setcontext/getcontext. Anyway Apple has implemented their own libunwind which is compatible with GNU's implementation at source level. The library is exist on Mac OS X 10.6, 10.7, and iOS. (I don't know exact version in case of iOS)
This library is not documented, but I could find the headers from these locations.
/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS5.0.sdk/usr/include/libunwind.h
/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator4.3.sdk/usr/include/libunwind.h
/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator5.0.sdk/usr/include/libunwind.h
/Developer/SDKs/MacOSX10.6.sdk/usr/include/libunwind.h
/Developer/SDKs/MacOSX10.7.sdk/usr/include/libunwind.h
There was a note in the header file that to go GNU libunwind site for documentation.
I'll bet on the library.

Higher level languages with C functions

Maybe this has been asked before, but I couldn't find it. My question is simple: Does it make sense to write an application in higher level languages (Java, C#, Python) and time/performance-critical functions in C? Or at this point unless you do very low level OS/game/sensor programming it is all the same to have a full, say, Java application?
It makes sense if you a) notice a performance issue, AND b) use performance measurements to locate where the problem occurs, AND c) can't achieve the desired performance by modifying the existing code.
If any of these items don't apply, then it's probably premature optimization.
If you are fluent and productive in a higher level language such a Python and Lua, then by all means start writing in that language. Look for bottlenecks if and when they exist.
speed can be quite similar with things like C#.
What is tricky is latency. So if you want to write something which you know takes < 10ms then C is reasonably predictable (ignoring whatever variability your operating system might introduce).
Having said that for very tight long loops (image processing for example), things like C/C++ can offer some speed up. You can get quite reasonable performance out of C#, you do have to be careful how you program it though, but I have found in general, you can still squeeze more out of C/C++
Usually your preferred language will do whatever you need it to in acceptable time (er, blazing fast).
Sure, critical time/performance functions can be written in a "more optimal/suitable" language like C or assembly - but whether it will actually make things faster is another story. There are laws that govern how much actual/overall speed-up that you'll get, specifically Amdahs Law and (diminishing returns) .
To answer your question, it only makes sense to rewrite these critical functions in lower languages if there is good enough speed-up to warrant the extra work.
I suggest you read Cliff Click's excellent Java vs. C Performance....Again.. It outlines many points of comparison between Java and C++.
Predictably, the conclusion is that it depends, but it's a worthwhile read.
You can only really answer this on a case by case basis and without reference to what you are doing it's impossible to answer.
But maybe what you actually want here is some kind of sanity check to ensure that this approach isn't crazy to consider. I have worked on tools ranging from a very large graphics application (~ million lines) to relatively small physical simulation engines (~10000 lines) there were written just as you describe: Python on the outside for interface (both for API and for GUI), C/C++ on the inside for the heavy lifting. They all benefited from this division of responsibility.
This is done especially with scripting languages. Things that come to mind are games made in Python. Most of the time Python is too slow for some of the more number crunching aspects of games and they make this a C module for speed. Be sure that you actually need the speed though and that number crunching is your performance bottleneck and not a general algorithm issue. Doing a brute-force search over a list is going to be slow in both C and Python.
I'd say that it depends a lot on your application.
What sort of performance is important? short startup-time? high throughput? low latency? Is it important that response time is always predictable?
Is the application short lived or does it run for long periods of time?
Java can give you high throughput, but occasional short freezes while doing garbage collection. C# is probably similar.
Python, well the performance there will often lag behind the others for anything not written in C (some things ARE written in C, even if you didn't do it yourself).
So as others said. It depends.
But as always with performance: Measure first, optimize when you know you need to.

Halting in non-Turing-complete languages

The halting problem cannot be solved for Turing complete languages and it can be solved trivially for some non-TC languages like regexes where it always halts.
I was wondering if there are any languages that has both the ability to halt and not halt but admits an algorithm that can determine whether it halts.
The halting problem does not act on languages. Rather, it acts on machines
(i.e., programs): it asks whether a given program halts on a given input.
Perhaps you meant to ask whether it can be solved for other models of
computation (like regular expressions, which you mention, but also like
push-down automata).
Halting can, in general, be detected in models with finite resources (like
regular expressions or, equivalently, finite automata, which have a fixed
number of states and no external storage). This is easily accomplished by
enumerating all possible configurations and checking whether the machine enters
the same configuration twice (indicating an infinite loop); with finite
resources, we can put an upper bound on the amount of time before we must see
a repeated configuration if the machine does not halt.
Usually, models with infinite resources (unbounded TMs and PDAs, for instance),
cannot be halt-checked, but it would be best to investigate the models and
their open problems individually.
(Sorry for all the Wikipedia links, but it actually is a very good resource for
this kind of question.)
Yes. One important class of this kind are primitive recursive functions. This class includes all of the basic things you expect to be able to do with numbers (addition, multiplication, etc.), as well as some complex classes like #adrian has mentioned (regular expressions/finite automata, context-free grammars/pushdown automata). There do, however, exist functions that are not primitive recursive, such as the Ackermann function.
It's actually pretty easy to understand primitive recursive functions. They're the functions that you could get in a programming language that had no true recursion (so a function f cannot call itself, whether directly or by calling another function g that then calls f, etc.) and has no while-loops, instead having bounded for-loops. A bounded for-loop is one like "for i from 1 to r" where r is a variable that has already been computed earlier in the program; also, i cannot be modified within the for-loop. The point of such a programming language is that every program halts.
Most programs we write are actually primitive recursive (I mean, can be translated into such a language).
The short answer is yes, and such languages can even be extremely useful.
There was a discussion about it a few months ago on LtU:
http://lambda-the-ultimate.org/node/2846

How to implement continuations?

I'm working on a Scheme interpreter written in C. Currently it uses the C runtime stack as its own stack, which is presenting a minor problem with implementing continuations. My current solution is manual copying of the C stack to the heap then copying it back when needed. Aside from not being standard C, this solution is hardly ideal.
What is the simplest way to implement continuations for Scheme in C?
A good summary is available in Implementation Strategies for First-Class Continuations, an article by Clinger, Hartheimer, and Ost. I recommend looking at Chez Scheme's implementation in particular.
Stack copying isn't that complex and there are a number of well-understood techniques available to improve performance. Using heap-allocated frames is also fairly simple, but you make a tradeoff of creating overhead for "normal" situation where you aren't using explicit continuations.
If you convert input code to continuation passing style (CPS) then you can get away with eliminating the stack altogether. However, while CPS is elegant it adds another processing step in the front end and requires additional optimization to overcome certain performance implications.
I remember reading an article that may be of help to you: Cheney on the M.T.A. :-)
Some implementations of Scheme I know of, such as SISC, allocate their call frames on the heap.
#ollie: You don't need to do the hoisting if all your call frames are on the heap. There's a tradeoff in performance, of course: the time to hoist, versus the overhead required to allocate all frames on the heap. Maybe it should be a tunable runtime parameter in the interpreter. :-P
If you are starting from scratch, you really should look in to Continuation Passing Style (CPS) transformation.
Good sources include "LISP in small pieces" and Marc Feeley's Scheme in 90 minutes presentation.
It seems Dybvig's thesis is unmentioned so far.
It is a delight to read. The heap based model
is the easiest to implement, but the stack based
is more efficient. Ignore the string based model.
R. Kent Dybvig. "Three Implementation Models for Scheme".
http://www.cs.indiana.edu/~dyb/papers/3imp.pdf
Also check out the implementation papers on ReadScheme.org.
https://web.archive.org/http://library.readscheme.org/page8.html
The abstract is as follows:
This dissertation presents three implementation models for the Scheme
Programming Language. The first is a heap-based model used in some
form in most Scheme implementations to date; the second is a new
stack-based model that is considerably more efficient than the
heap-based model at executing most programs; and the third is a new
string-based model intended for use in a multiple-processor
implementation of Scheme.
The heap-based model allocates several important data structures in a
heap, including actual parameter lists, binding environments, and call
frames.
The stack-based model allocates these same structures on a stack
whenever possible. This results in less heap allocation, fewer memory
references, shorter instruction sequences, less garbage collection,
and more efficient use of memory.
The string-based model allocates versions of these structures right in
the program text, which is represented as a string of symbols. In the
string-based model, Scheme programs are translated into an FFP
language designed specifically to support Scheme. Programs in this
language are directly executed by the FFP machine, a
multiple-processor string-reduction computer.
The stack-based model is of immediate practical benefit; it is the
model used by the author's Chez Scheme system, a high-performance
implementation of Scheme. The string-based model will be useful for
providing Scheme as a high-level alternative to FFP on the FFP machine
once the machine is realized.
Besides the nice answers you've got so far, I recommend Andrew Appel's Compiling with Continuations. It's very well written and while not dealing directly with C, it is a source of really nice ideas for compiler writers.
The Chicken Wiki also has pages that you'll find very interesting, such as internal structure and compilation process (where CPS is explained with an actual example of compilation).
Examples that you can look at are: Chicken (a Scheme implementation, written in C that support continuations); Paul Graham's On Lisp - where he creates a CPS transformer to implement a subset of continuations in Common Lisp; and Weblocks - a continuation based web framework, which also implements a limited form of continuations in Common Lisp.
Continuations aren't the problem: you can implement those with regular higher-order functions using CPS. The issue with naive stack allocation is that tail calls are never optimised, which means you can't be scheme.
The best current approach to mapping scheme's spaghetti stack onto the stack is using trampolines: essentially extra infrastructure to handle non-C-like calls and exits from procedures. See Trampolined Style (ps).
There's some code illustrating both of these ideas.
The traditional way is to use setjmp and longjmp, though there are caveats.
Here's a reasonably good explanation
Continuations basically consist of the saved state of the stack and CPU registers at the point of context switches. At the very least you don't have to copy the entire stack to the heap when switching, you could only redirect the stack pointer.
Continuations are trivially implemented using fibers. http://en.wikipedia.org/wiki/Fiber_%28computer_science%29
. The only things that need careful encapsulation are parameter passing and return values.
In Windows fibers are done using the CreateFiber/SwitchToFiber family of calls.
in Posix-compliant systems it can be done with makecontext/swapcontext.
boost::coroutine has a working implementation of coroutines for C++ that can serve as a reference point for implementation.
As soegaard pointed out, the main reference remains R. Kent Dybvig. "Three Implementation Models for Scheme".
The idea is, a continuation is a closure that keeps its evaluation control stack. The control stack is required in order to continue the evalution from the moment the continuation was created using call/cc.
Oftenly invoking the continuation makes long time of execution and fills the memory with duplicated stacks. I wrote this stupid code to prove that, in mit-scheme it makes the scheme crash,
The code sums the first 1000 numbers 1+2+3+...+1000.
(call-with-current-continuation
(lambda (break)
((lambda (s) (s s 1000 break))
(lambda (s n cc)
(if (= 0 n)
(cc 0)
(+ n
;; non-tail-recursive,
;; the stack grows at each recursive call
(call-with-current-continuation
(lambda (__)
(s s (- n 1) __)))))))))
If you switch from 1000 to 100 000 the code will spend 2 seconds, and if you grow the input number it will crash.
Use an explicit stack instead.
Patrick is correct, the only way you can really do this is to use an explicit stack in your interpreter, and hoist the appropriate segment of stack into the heap when you need to convert to a continuation.
This is basically the same as what is needed to support closures in languages that support them (closures and continuations being somewhat related).

Resources