I have a little problem with threads in Erlang NIFs. You can view my code here: http://pastebin.com/HMCj24Jp. The problem is that when I starting the thread it takes some arguments and starts the generate_binary function. This is okay but when I'm trying to read the arguments everything crashes.
It's perhaps not the most complex problem, but I could not find any documentation about this so I hope some of you might know the answer.
Your generate_buffer() NIF is creating a thread to call generate_binary() but the calling NIF doesn't wait for the newly-created thread to finish. The thread just gets created and likely is still running by the time the NIF returns, though this will be nondeterministic, as threads are in general. You're probably crashing the Erlang BEAM emulator because generate_binary() is off trying to call into the Erlang run-time system after generate_buffer() has returned, confusing the poor thing horribly.
Now, even assuming you fix this to make it do what you wanted, I don't think you should be using explicit native threads here at all.
First, Erlang NIFs are supposed to look like regular Erlang functions, differing only in that they happen to be written in a different language. Erlang functions don't spawn separate threads of execution, then return, leaving that thread running. Excepting those that deal with I/O and persistent data storage, Erlang functions are deterministic and referentially transparent. Your NIF is neither. So, even if it worked, it's still "wrong" in the sense that it violates an experienced Erlang programmer's expectations.
Second, if you need multiprocessing, Erlang already provides the idea of processes. If your NIF will really do so much work that it can benefit from multiprocessing, why not rework your NIF so it can work on a subrange of the data, then call it multiple times, once each from a number of Erlang processes? Then you don't need explicit native threads; the BEAM emulator will create the optimal number of threads for you, transparently.
Third, thread creation overhead is going to kill performance if the lifetime of the thread only extends over the course of a single Erlang NIF call, as it seems you actually intended. This is another reason Erlang processes will be more efficient here.
Related
As the title suggests, is there a way in C to detect when a user-level thread running on top of a kernel-level thread e.g., pthread has blocked (or about to block) for I/O?
My use case is as follows: I need to execute tasks in a multithreaded environment (on top of kernel threads e.g., pthreads). The tasks are basically user functions that can be synchronized and may use blocking operations within. I need to hide latency in my implementation. So, I am exploring the idea of implementing the tasks as user-level threads for better control of their execution context such that, when a task blocks or synchronizes, I context-switch to other ready tasks (i.e., implementing my own scheduler for the user-level threads). Consequently, almost the full use of the OS’s time quantum per kernel thread can be achieved.
There used to be code that did this, for example GNU pth. It's generally been abandoned because it just doesn't work very well and we have much better options now. You have two choices:
1) If you have OS help, you can use the OS mechanisms. Windows provides OS help for this, IOCP dispatching uses it.
2) If you have no OS help, then you have to convert all blocking operations into non-blocking ones that call your dispatcher rather than blocking. So, for example, if someone calls socket, you intercept that call and set the socket non-blocking. When they call read, you intercept that call and if they get a "would block" indication, you arrange to resume when the operation might succeed and schedule another thread.
You can look at GNU pth to see how you might make option 2 work. But be warned, GNU pth is full of reported bugs that have never been fixed since it was abandoned. It will give you an idea of how to implement things like mutexes and sleeps in a cooperative user-space threading environment. But don't actually use the code.
Generally, when a process writes to a file, e.g a python script running open('file', 'w').write('text'), what are the exact events that occur? By that I mean something among the lines of 'process A loads file from hard disk to RAM, process B changes content then ...'. I've read about IPC and now I'm trying to dig deeper and understand more on the subject of processes. I couldn't find a thorough explanation on the subject, so if you could find one or explain I'd really appreciate it.
The example of "a python script running open('file', 'w').write('text')" is heavily OS-dependent. The only processes involved here are the process running the Python interpreter, which, e.g. on Linux, can sometimes execute in userspace and sometimes execute in kernel space, and possibly some kernel-only processes, with any IPC, if required, happening inside the kernel. There is no particular requirement that everything down to the disk read itself cannot be handled on the user's process when it is running in kernel mode, but in practice, there may be other processes involved. This is OS- and even driver-specific behavior.
In this particular example (which isn't great, because it relies on the automatic cPython close when the variable goes out of scope), the Python process makes a system call to open a file, one to write the file, and one to close the file. These are all blocking -- that is, they do not return until the results are ready. When the process blocks, it is put on a queue waiting for some event to occur to make it ready to run again.
The opposite of this is asynchronous I/O, which can be performed by polling, by callbacks, or by the select statement, which can block until any one of a number of events has occurred.
But when most people talk about IPC, they are not usually talking about communication between or with kernel processes. Rather, they are talking about communication between multiple user processes and/or threads, using semaphores, mutexes, named pipes, etc. A good introduction to these sorts of things would be any tutorial information you can find on using pthreads, or even the Python threads and multiprocessing modules. There are examples there for several simple cases.
The primary difference between processes and threads on Linux is that threads share an address space and processes each have their own address space. Python itself adds the wrinkle of the GIL, which limits the utility of threads in Python somewhat.
I've been looking into how I could embed languages (let's use Lua as an example) in Erlang. This of course isn't a new idea and there are many libraries out there that can do this. However I was wondering if it was possible to start a Genserver with state which is modified by Lua. This means that once you start the Genserver, it will start a (long running) Lua process to manipulate the Genserver's state. I know this is possible as well, but I was wondering if I could spawn 1,000 10,000 or even 100,000 of these processes.
I'm not really familiar with this topic but I have done some research.
(Please correct me if I'm wrong on any of these options).
TLDR; Skip to the last paragraph.
First option: NIFs:
This doesn't seem like an option since it will block the Erlang Scheduler of the current process. If I want to spawn a large amount of these it will freeze the entire runtime.
Second option: Port Driver:
It's like a NIF but communicates by sending data to a specified port, which can also send data back to Erlang. This is nice although this also seems to block the scheduler. I've tried a library which does the boiler plat for you as well, but that seemed to block the scheduler after spawning 10 processes. I've also looked into the postgresql example on the Erlang Documentation which is said to be async but I couldn't get the example code to work (R13?). Is it even possible to run as many Port Driver processes without blocking the runtime?
Third option: C Nodes:
I thought this was very interesting and wanted to try it out, but apparently the project "erlang-lua" already does this. It's nice because it won't crash your Erlang VM if something goes wrong and the processes are isolated. But in order to actually spawn a single process you need to spawn an entire node. I have no idea how expensive this is. Nor am I sure what the limit is for connecting nodes in a cluster, but I don't see myself spawning 100,000 C nodes.
Fourth option: Ports:
At first I thought this was the same as a Port Driver but it's actually different. You spawn a process which executes an application and communicates through STDIN and STDOUT. This would work well for spawning a large amount of processes, and (I think?) they aren't a threat to the Erlang VM. But if I'm going to communicate through STDIN / STDOUT, why even bother with an embeddable language to begin with? Might as well use any other scripting language.
And so after much research in a field I'm not familiar with I've come to this. You could a Genserver as an "entity" where the AI is written in Lua. Which is why I'd like to have a processes for each entity. My question is how do I achieve spawning many Genservers which communicate with long running Lua processes? Is this even possible? Should I be tackling my problem differently?
If you can make the Lua code — or more accurately, its underlying native code — cooperate with the Erlang VM, you have a few choices.
Consider one of the most important functions of the Erlang VM: managing the execution of a (potentially large number of) Erlang's lightweight processes across a relatively small set of scheduler threads. It uses several techniques to know when a process has used up its timeslice or is waiting and so should be scheduled out to give another process a chance to run.
You seem to be asking how you can get native code to run however it likes within the VM, but as you've already hinted, the reason native code can cause problems for the VM is that it has no practical way to stop the native code from completely taking over a scheduler thread and thus preventing regular Erlang processes from executing. Because of this, native code has to cooperatively yield the scheduler thread back to the VM.
For older NIFs, the choices for such cooperation were:
Keep the amount of time NIF calls ran on a scheduler thread to 1ms or less.
Create one or more private threads. Transition each long-running NIF call from its scheduler thread over to a private thread for execution, then return the scheduler thread to the VM.
The problems here are that not all calls can complete in 1ms or less, and that managing private threads can be error-prone. To get around the first problem, some developers would break the work down into chunks and use an Erlang function as a wrapper to manage a series of short NIF calls, each of which completed one chunk of work. As for the second problem, well, sometimes you just can't avoid it, despite its inherent difficulty.
NIFs running on Erlang 17.3 or later can also cooperatively yield the scheduler thread using the enif_schedule_nif function. To use this feature, the native code has to be able to do its work in chunks such that each chunk can complete within the usual 1ms NIF execution window, similar to the approach mentioned earlier but without the need to artificially return to an Erlang wrapper. My bitwise example code provides many details about this.
Erlang 17 also brought an experimental feature, off by default, called dirty schedulers. This is a set of VM schedulers that do not have the same native code execution time constraints as the regular schedulers; work there can block for essentially infinite periods without disrupting normal VM operation.
Dirty schedulers come in two flavors: CPU schedulers for CPU-bound work, and I/O schedulers for I/O-bound work. In a VM compiled to enable dirty schedulers, there are by default as many dirty CPU schedulers as there are regular schedulers, and there are 10 I/O schedulers. These numbers can be altered using command-line switches, but note that to try to prevent regular scheduler starvation, you can never have more dirty CPU schedulers than regular schedulers. Applications use the same enif_schedule_nif function mentioned earlier to execute NIFs on dirty schedulers. My bitwise example code provides many details about this too. Dirty schedulers will remain an experimental feature for Erlang 18 as well.
Native code in linked-in port drivers is subject to the same on-scheduler execution time constraints as NIFs, but drivers have two features NIFs don't:
Driver code can register file descriptors into the VM polling subsystem and be notified when any of those file descriptors becomes I/O-ready.
The driver API supports access to a non-scheduler async thread pool, the size of which is configurable but by default has 10 threads.
The first feature allows native driver code to avoid blocking a thread for I/O. For example, instead of performing a blocking recv call, driver code can register the socket file descriptor so the VM can poll it and call the driver back when the file descriptor becomes readable.
The second feature provides a separate thread pool useful for driver tasks that can't conform to the scheduler thread native code execution time constraints. You can achieve the same in a NIF but you have to set up your own thread pool and write your own native code to manage and access it. But regardless of whether you use the driver async thread pool, your own NIF thread pool, or dirty schedulers, note that they are all regular operating system threads, and so trying to start a huge number of them simply isn't practical.
Native driver code does not yet have dirty scheduler access, but this work is on-going and it might become available as an experimental feature in an 18.x release.
If your Lua code can make use of one or more of these features to cooperate with the Erlang VM, then what you're attempting may be possible.
Working on a project that request to download about 300 pics from different locations by using wget every 20 minutes.
I wrote a C program that reads the database for all the Ids and locations into an array.
For each entry in the array, I call the external wget command to download it.
It works but is slow because it is doing one by one.
My thinking is to use either Multi-process, multi-thread or openMP to create several children.
Any suggestion for how to do this is appreciate.
Multiple Processes
An error in one process cannot crash another process. This is particularly useful when you will host third-party code (e.g. plugins), and this is the approach that (among others) Google Chrome takes. The disadvantage is that N processes use more system resources than N threads.
Multiple Threads
Uses fewer system resources than an equivalent number of processes. Thread programming is more error prone for many developers, and an error in one thread can affect other threads.
Best Option
For what you are doing, you are unlikely to see a significant difference in resource utilization. Use whichever model you can write fast in high quality.
Personally I would go for multi process. The wget's do not need to share any memory or communicate (other than an exit status which is only needed by the root) so a thread will not provide any additional benefit (in my opinion). As well as this creating them as processed allows the OS scheduler to best decide when to run each process.
I understand the differences between a multithreaded program and a program relying on inter-machine communication. My problem is that I have a nice multithreaded program written in 'C' that works and runs really well on an 8-core machine. There is now opportunity to port this program to a cluster to gain access to more cores. Is it worth the effort to rip out the pthread stuff and retrofit MPI (which I've never used) or are we better off recoding the whole thing (or most of it) from scratch? Assume we are "stuck" with C so a wholesale change of language isn't an option.
Depending on how your software is written, there may or may not be advantages to going to MPI over keeping your pthread implementation.
Unfortunately (or fortunately), message passing is a very different beast than pthreading - the basic assumption is quite different. I love this quote from Joshua Phillips of the Maestro team: "The difference between message-passing and shared-state communication is equivalent to the difference between sending a colleague an e-mail requesting her to complete a task and opening up her organizer to write down the task directly in her to-do list. More than just being rude, the latter is likely to confuse her – she might erase it, not notice it, or accidentally prioritize it incorrectly."
Unfortunately, the way you share data is very different. There is no direct access to data in other threads (since it can be on other machines), so it can be a very daunting task to migrate from pthreads to MPI. On the other hand, if the code is written so each thread is isolated, it can be an easy task, and definitely worthwhile.
In order to determine how useful this will be, you'll need to understand the code, and what you hope to achieve by switching. It can be worthwhile as a learning experience (you learn a LOT about synchronization and threading by working in MPI), but may not be practical if the gains will be minor.
Re. your comment to Reed -- this sounds like an easy, low-overhead conversion to MPI. Just be careful: not all MPI APIs support dynamic creation of processes, i.e., you start your program with N processes (specified at startup) and you're stuck with N processes throughout the life-time of the program.