Say that I have N threads accessing an array with N elements. The array has been prepared before the threads start. Each thread will access a different element (the thread I will access element I, both for reading and writing).
In theory, I'd expect such an access pattern not to cause any race conditions, but will Ruby actually guarantee thread safety in this case?
but will Ruby actually guarantee thread safety in this case
Ruby does not have a defined memory model, so there are no guarantees of any kind.
YARV has a Giant VM Lock which prevents multiple Ruby threads from running at the same time, which gives some implicit guarantees, but this is a private, internal implementation detail of YARV. For example, TruffleRuby, JRuby, and Rubinius can run multiple Ruby threads in parallel.
Since there is no specification of what the behavior should be, any Ruby implementation is free to do whatever they want. Most commonly, Ruby implementors try to mimic the behavior of YARV, but even that is not well-defined. In YARV, data structures are generally not thread-safe, so if you want to mimic the behavior of YARV, do you make all your data structures not thread-safe? But in YARV, also multiple threads cannot run at the same time, so in a lot of cases, operations are implicitly thread-safe, so if you want to mimic YARV, should you make your data structures thread-safe?
Or, in order to mimic YARV, should you prevent multiple threads from running at the same time? But, being able to run multiple threads in parallel is actually one of the reasons why people choose, for example JRuby over YARV.
As you can see, this is very much not a trivial question.
The best solution is to verify the behavior of each Ruby implementation separately. Actually, that is the second best solution.
The best solution is to use something like the concurrent-ruby Gem where someone else has already done the work of verifying the behavior of each Ruby implementation for you. The concurrent-ruby maintainers have a close relationship with several Ruby implementations (Chris Seaton, one of the two lead maintainers of concurrent-ruby is also the lead developer of TruffleRuby, a JRuby core developer, and a member of ruby-core, for example), and so you can generally be certain that everything that is in concurrent-ruby is safe on all supported Ruby implementations (currently YARV, JRuby, and TruffleRuby).
Concurrent Ruby has a Concurrent::Array class which is thread-safe. You can see how it is implemented here: https://github.com/ruby-concurrency/concurrent-ruby/blob/master/lib/concurrent-ruby/concurrent/array.rb As you can see, for YARV, Concurrent::Array is actually the same as ::Array, but for other implementations, more work is required.
The concurrent-ruby developers are also working on specifying Ruby's memory model, so that in the future, both programmers know what to expect and what not to expect, and implementors know what they are allowed to optimize and what they aren't.
Alternatives to Mutable Arrays
In standard Ruby implementations, an Array is not thread-safe. However, a Queue is. On the other hand, a Queue is not quite an Array, so you don't have all the methods on Queue that you may be looking for.
The Concurrent Ruby gem provides a thread-safe Array class, but as a rule thread-safe classes will be slower than those that aren't. Depending on your data this may not matter, but it's certainly a design consideration.
If you know from the beginning that you're going to be heavily reliant on threading, you should build your application on a Ruby implementation that offers concurrency and threading to begin with (e.g. consider JRuby or TruffleRuby), and design your application to take advantage of Ractors or use other concurrency models that treat data as immutable rather than sharing objects between threads.
Immutable data is a better pattern for threading than shared objects. You may or may not have problems with any given mutable object given enough due care, but Ractors and fiber-local variables should be faster and safer than trying to make mutable objects threat-safe. YMMV, though.
Related
I'm working with Xilinx ISE on a Spartan-6, which is driving a complex board with multiple functions. As you can imagine the VHDL project is becoming pretty complex and as a C++ programmer I feel the need of using arrays to compact the code.
I've already tried using those in the past but I had many problems. At the time I wasn't very experienced and a lot of other errors where present, which I solved after ditching the array structures.
Another problem I encountered was the impossibility to simulate the post-translate (again with arrays), but as I discovered that simulation is bugged because it doesn't initialize the LUTs created.
So here are the questions: what precautions do I have to keep in mind when using arrays? What are the most important design practices with arrays? Will I have problems in simulating sub-modules with the post-map or the post-PAR simulation?
The only complication that I can see with using an array vs using a "flat" signal by name is related to optimization that happens during the build process and how the names of things can be modified so that they are not easily recognizable. As you may have noticed, this is particularly true for multi-dimensional arrays and it is also true with records. Other than that, if you use a flat structure, or an array, they should implement the same, assuming you are describing the same structure in two different ways.
It may be difficult to determine the "new name" for your array due to the optimizations/renaming that took place, but this should not be confused with different design implementation on the hardware level.
When using arrays the syntax can be particularly troublesome when dealing with initialization. Be sure to consult your synthesis tool reference manual or user guide for support syntax and constructs.
In my opinion, the "cost" of using an array (or record) in terms of additional complication in post-synthesis activities is greatly outweighed by the simplification and clarity that can be gained in the code.
Arrays and records, and combinations of these, are one of the great strengths of VHDL, and makes it possible to create functional, readable, scalable and maintainable code. These types have always been in VHDL, and I have not encountered any problems in tools due to these types. VHDL tools are generally so mature that it is in more subtle features that you may encounter bugs.
If you encounter interface changes in post-map/PAR netlist, then I think it should be addresses by a wrapper round the design in order to fit into an existing test bench.
Since the majority of the simulation verification is (should be) made at VHDL level (any post-map/PAR simulation is only for sanity check of STA) it is much more important to have a high level and abstract view of the design during such simulation and debugging, instead of designing at bit level just to make the design match at post-map/PAR simulation.
I use arrays for two things.
First, to make code concise and simple. For example:
signal cnt1 : unsigned(10 downto 0);
signal cnt2 : unsigned(10 downto 0);
...
signal cnt9 : unsigned(10 downto 0);
Can be replaced by:
type cnt_arr_t is array(1 to 9) of unsigned(10 downto 0);
signal cnt : cnt_arr_t;
I never had any "problem" in Xilinx using arrays like this, they behave the same as defining multiples signals, they can be multi-dimensionnal, you can pass them to function, have them on entity ports, etc. The XST user guide specifies that multi-dimensional arrays can't be used as an entity port, but Molten Zilmer uses them without having issues.
The second, more common use of arrays is to define memories (RAM, ROM, dual-port RAM, etc). In that case, you have to be careful how you use the array or the design won't map to memory resources. A 512x32 memory use 1 blockRAM or lots of LUTs/register. Memories are smaller, faster and less power intensive for large depth values.
Look at the XST synthesis guide for how to use array to define memories. As a general rule, your array indices should start at 0 (1 to 2048 is not equivalent to 0 to 2047). Also, multi-dimensionnal arrays do not map to memory resources, at least when I tried it with an older version of XST. Otherwise, arrays of std_logic_vector/unsigned and arrays of records are fine.
You should always look at the synthesis report to make sure XST understand your code as you intend. Every memory mapping is reported with it's mode, depth, width and more. It's the best way to spot misinterpretation errors.
I must admit I never have to do post-map/PAR simulations, so I can't tell you if you will have problems.
Do I understand the new Std right that shared_ptr is not required to use a reference count? Only that it is likely that it is implemented this way?
I could imagine an implementation that uses a hidden linked-list somehow. In N3291 "20.7.2.2.5.(8) shared_ptr observers [util.smartptr.shared.obs]" The note says
[ Note: use_count() is not necessarily efficient. — end note ]
which gave me that idea.
You're right, nothing in the spec requires the use of an explicit "counter", and other possibilities exist.
For example, a linked-list implementation was suggested for the implementation of boost's shared_ptr; however, the proposal was ultimately rejected because it introduced costs in other areas (size, copy operations, and thread safety).
Abstract description
Some people say that shared_ptr is a "reference counter smart pointer". I don't think it is the right way to look at it.
Actually shared_ptr is all about (non-exclusive) ownership: all the shared_ptr that are copies of a shared_ptr initialised with a pointer p are owners.
shared_ptr keeps track of the set of owners, to guaranty that:
while the set of owners is non-empty, delete p is not called
when the set of owners becomes empty, delete p (or a copy of D the destruction functor) is called immediately;
Of course, to determine when the set of owners becomes empty, shared_ptr only needs a counter. The abstract description is just slightly easier to think about.
Possible implementations techniques
To keep track of the number of owners, a counter is not only the most obvious approach, it's also relatively obvious how to make thread-safe using atomic compare-and-modify.
To keep track all the owners, a linked list of owner is not only the obvious solution, but also an easy way to avoid the need to allocate any memory for each set of owners. The problem is that it isn't easy to make such approach efficiently thread safe (anything can be made thread safe with a global lock, which is against the very idea of parallelism).
In the case of multi-thread implementation
On the one hand, we have a small, fix-size (unless the custom destruction function is used) memory allocation, that's very easy to optimise, and simple integer atomic operations.
On the other hand, there is costly and complicated linked-list handling; and if a per owners set mutex is needed (as I think it is), the cost of memory allocation is back, at which point we can just replace the mutex with the counter!
About multiple possible implementations
How many times I have read that many implementations are possible for a "standard" class?
Who has never heard this fantasy that the complex class that could be implemented as polar coordinates? This is idiotic, as we all know. complex must use Cartesian coordinates. In case polar coordinates are preferred, another class must be created. There is no way a polar complex class is going to be used as a drop-in replacement for the usual complex class.
Same for a (non-standard) string class: there is no reason for a string class to be internally NUL terminated and not store the length as an integer, just for the fun and inefficiency of repeatedly calling strlen.
We now know that designing std::string to tolerate COW was a bad idea that is the reason for the unusual invalidation semantics of const iterators.
std::vector is now guaranteed to be continuous.
The end of the fantasy
At some point, the fantasy where standard classes have many significantly different reasonable implementations has to be dropped. Standard classes are primitive building blocks; not only they should be very efficient, they should have predictable efficiency.
A programmer should be able to make portable assumptions about the relative speed of basic operations. A complex class is useless for serious number crunching if even the simplest addition turns into a bunch a transcendental computations. If a string class is not guaranteed to have very fast copy via data sharing, the programmer will have to minimize string copies.
An implementer is free to choose a different implementation techniques only when it doesn't make a common cheap operation extremely costly (by comparison).
For many classes, this means that there is exactly one viable implementation strategy, with sometimes a few degrees of liberty (like the size of a block in a std::deque).
The documentation of Data.Array reads:
Haskell provides indexable arrays, which may be thought of as
functions whose domains are isomorphic to contiguous subsets of the
integers. Functions restricted in this way can be implemented
efficiently; in particular, a programmer may reasonably expect rapid
access to the components.
I wonder how fast can (!) and (//) be. Can I expect O(1) complexity from these, as I would have from their imperative counterparts?
In general, yes, you should be able to expect O(1) from ! although I'm not sure if thats guaranteed by the standard.
You might want to see the vector package if you want faster arrays though (through use of stream fusion). It is also better designed.
Note that // is probably O(n) though because it has to traverse the list (just like an imperative program would). If you need a lot of mutation you can use MArray orMVector.
I am implementing a call graph program for a C using perl script. I wonder how to resolve call graphs for function pointers using output of 'objdump'?
How different call graph applications resolve function pointers?
Are function pointers resolved at run time or they can be done statically?
EDIT
How do call graphs resolve cycles in static evaluation of program?
It is easy to build a call graph of A-calls-B when the call statement explicitly mentions B. It is much harder to handle indirect calls, as you've noticed.
Good static analysis tools form estimates of the contents of pointer variables by propagating pointer assignments/copies/arithmetic across program data flows (inter and intra-procedural ["global"]) using a variety of schemes, often conservative ("you get too much").
Without such an estimate, you cannot have any idea what a pointer contains and therefore simply cannot make a useful prediction (well, you can use the ultimate conservative estimate that it will go anywhere, but I think you've already rejected that solution).
Our DMS Software Reengineering Toolkit has static control/dataflow/points-to/call graph analysis that has been applied to huge systems (~~25 million lines) of C code, and produced such call graphs. The machinery to do this
is pretty complex but you can find it in advanced topics in the compiler literature. I doubt you want to implement this in Perl.
This is easier when you have source code, because you at least reliably know what is code, and what is not. You're trying to do this on object code, which means you can't even eliminate data.
Using function pointers is a way of choosing the actual function to call at runtime, so in general, it wouldn't be possible to know what would actually happen statically.
However, you could look at all functions that are possible to call and perhaps show those in some way. Often the callbacks have a unique enough signature (not always).
If you want to do better, you have to analyze the source code, to see which functions are assigned to pointers to begin with.
I'm trying to find a lightweight cooperative threading solution to try implementing an actor model.
As far as I know, the only solution is setcontext/getcontext,
but the functionality is deprecated(?) by Apple. I'm confused by why they did this; however, I'm finding replacement for this.
Pthreads are not an option because I need cooperative model instead of preemptive model to control context switching timing precisely/manually without expensive locking.
-- edit --
Reason of avoiding pthreads:
Because pthreads are not cooperative/deterministic and too expensive. I need actor model for game logic code, so thousand of execution context are required at minimal. Hardware threading requires MB of memory and expense to create/destruct. And parallelism is not important. In fact, I just need concurrent execution of many functions. This can be implemented with many divided functions and some kind of object model, but my goal is reducing those overheads.
If I know something wrong, please correct me. It'll be very appreciated.
The obvious 'lightweight' solution is to avoid complex nested calling except for limited situations where the execution time will be tightly bounded, then store an explicit state structure for each "thread" and implement the main program logic as a state machine that's easily suspendable/resumable at most points. Then you can simply swap out the pointer to the state structure for 'context switch'. Basically this technique amounts to keeping all of your important state variables, including what would conventionally be local variables, in the state structure.
Whether this is worthwhile probably depends on your reason for avoiding pthreads. If your reason is to be portable to non-POSIX systems, or if you really need deterministic program flow, then it may be worthwhile. But if you're just worried about performance overhead and memory synchronization issues, I think you should use pthreads and manage these issues. If you avoid unnecessary locking, use fine-grained locks, and minimize the amount of time locks are held, performance should not suffer.
Edit: Based on your further details posted in the comments on the main question, I think the solution I've proposed is the right one. Each actor should have their own context in which you store the state of the actor's action/thinking/etc. You would have a run_actor function which would take an actor context and a number of "ticks" to advance the actor's state by, and a run_all_actors function which would iterate over a list of active actors and call run_actor for each with the specified number of ticks.
Further, note that this solution still allows you to use real threads to take advantage of SMP/multicore machines. You simply divide the actors up between threads. You may need some degree of locking if one actor needs to examine another's context (e.g. for collision detection).
I was researching this question as well, and I ran across GNU Pth (not to be confused with Pthreads). See http://www.gnu.org/software/pth/
It aims to be a portable solution for cooperative threads. It does mention it is implemented via setcontext/getcontext if available (so it may not be on Mac OSX). Otherwise it says it uses longjmp/setjmp, but it's not clear to me how that works.
Hope this is helpful to anyone who searches for this question.
I have discovered the some of required functionalities from setcontext/getcontext are implemented in libunwind.
Unfortunately the library won't be compiled on Mac OS X because of deprecation of the setcontext/getcontext. Anyway Apple has implemented their own libunwind which is compatible with GNU's implementation at source level. The library is exist on Mac OS X 10.6, 10.7, and iOS. (I don't know exact version in case of iOS)
This library is not documented, but I could find the headers from these locations.
/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS5.0.sdk/usr/include/libunwind.h
/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator4.3.sdk/usr/include/libunwind.h
/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator5.0.sdk/usr/include/libunwind.h
/Developer/SDKs/MacOSX10.6.sdk/usr/include/libunwind.h
/Developer/SDKs/MacOSX10.7.sdk/usr/include/libunwind.h
There was a note in the header file that to go GNU libunwind site for documentation.
I'll bet on the library.