MapReduce or a batch job? - batch-file

I have a function which needs to be called on a lot of files (1000's). Each is independent of another, and can be run in parallel. The output of the function for each of the files does not need to be combined (currently) with the other ones. I have a lot of servers I can scale this on but I'm not sure what to do:
1) Run a MapReduce on it
2) Create 1000's of jobs (each has a different file it works on).
Would one solution be preferable to another?
Thanks!

MapReduce will provide significant value for distributing large dataset workloads. In your case, being smaller independent jobs on small independent data files, in my opinion it could be overkill.
So, I would prefer run a bunch of dynamically created batch files.
Or, alternatively, use a cluster manager and job scheduler, like SLURM https://computing.llnl.gov/linux/slurm/
SLURM: A Highly Scalable Resource Manager
SLURM is an open-source resource manager designed for Linux clusters
of all sizes. It provides three key functions. First it allocates
exclusive and/or non-exclusive access to resources (computer nodes) to
users for some duration of time so they can perform work. Second, it
provides a framework for starting, executing, and monitoring work
(typically a parallel job) on a set of allocated nodes. Finally, it
arbitrates contention for resources by managing a queue of pending
work.

Since it is only 1000's of files (and not 1000000000's of files) a full blown HADOOP setup is probably overkill. GNU Parallel tries to fill the gap between sequential scripts and HADOOP:
ls files | parallel -S server1,server2 your_processing {} '>' out{}
You will probably want to learn about --sshloginfile. Depending on where the files are stored you may want to learn --trc, too.
Watch the intro video to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ

Use a job array in slurm. No need to submit 1000s of jobs...just 1 - the array job.
This will kick off the same program on as many nodes / cores as are available with the resources you specify. Eventually it will churn through them all. Your only issue is how to map the array index to a file to process. Simplest way would be to prepare a text file with a list of all the paths, one per line. Each element of the job-array will get the ith line of this file and use that as the path of the file to process.

Related

Data/event exchange between jobs

Is it possible in Apache Flink, to create an application, which consists of multiple jobs who build a pipeline to process some data.
For example, consider a process with an input/preprocessing stage, a business logic and an output stage.
In order to be flexible in development and (re)deployment, I would like to run these as independent jobs.
Is it possible in Flink to built this and directly pipe the output of one job to the input of another (without external components)?
If yes, where can I find documentation about this and can it buffer data if one of the jobs is restarted?
If no, does anyone have experience with such a setup and point me to a possible solution?
Thank you!
If you really want separate jobs, then one way to connect them is via something like Kafka, where job A publishes, and job B (downstream) subscribes. Once you disconnect the two jobs, though, you no longer get the benefit of backpressure or unified checkpointing/saved state.
Kafka can do buffering of course (up to some max amount of data), but that's not a solution to a persistent different in performance, if the upstream job is generating data faster than the downstream job can consume it.
I imagine you could also use files as the 'bridge' between jobs (streaming file sink and then streaming file source), though that would typically create significant latency as the downstream job has to wait for the upstream job to decide to complete a file, before it can be consumed.
An alternative approach that's been successfully used a number of times is to provide the details of the preprocessing and business logic stages dynamically, rather than compiling them into the application. This means that the overall topology of the job graph is static, but you are able to modify the processing logic while the job is running.
I've seen this done with purpose-built DSLs, PMML models, Javascript (via Rhino), Groovy, Java classloading, ...
You can use a broadcast stream to communicate/update the dynamic portions of the processing.
Here's an example of this pattern, described in a Flink Forward talk by Erik de Nooij from ING Bank.

Flink batch: data local planning on HDFS?

we've been playing a bit with Flink. So far we've been using Spark and standard M/R on Hadoop 2.x / YARN.
Apart from the Flink execution model on YARN, that AFAIK is not dynamic like spark where executors dynamically take and release virtual-cores in YARN, the main point of the question is as follows.
Flink seems just amazing: for streaming API's, I'd only say that it's brilliant and over the top.
Batch API's: processing graphs are very powerful and are optimised and run in parallel in a unique way, leveraging cluster scalability much more than Spark and others, optiziming perfectly very complex DAG's that share common processing steps.
The only drawback I found, that I hope is just my misunderstanding and lack of knowledge is that it doesn't seem to prefer data-local processing when planning the batch jobs that use input on HDFS.
Unfortunately it's not a minor one because in 90% use cases you have a big-data partitioned storage on HDFS and usually you do something like:
read and filter (e.g. take only failures or successes)
aggregate, reduce, work with it
The first part, when done in simple M/R or spark, is always planned with the idiom of 'prefer local processing', so that data is processed by the same node that keeps the data-blocks, to be faster, to avoid data-transfer over the network.
In our tests with a cluster of 3 nodes, setup to specifically test this feature and behaviour, Flink seemed to perfectly cope with HDFS blocks, so e.g. if file was made up of 3 blocks, Flink was perfectly handling 3 input-splits and scheduling them in parallel.
But w/o the data-locality pattern.
Please share your opinion, I hope I just missed something or maybe it's already coming in a new version.
Thanks in advance to anyone taking the time to answer this.
Flink uses a different approach for local input split processing than Hadoop and Spark. Hadoop creates for each input split a Map task which is preferably scheduled to a node that hosts the data referred by the split.
In contrast, Flink uses a fixed number of data source tasks, i.e., the number of data source tasks depends on the configured parallelism of the operator and not on the number of input splits. These data source tasks are started on some node in the cluster and start requesting input splits from the master (JobManager). In case of input splits for files in an HDFS, the JobManager assigns the input splits with locality preference. So there is locality-aware reading from HDFS. However, if the number of parallel tasks is much lower than the number of HDFS nodes, many splits will be remotely read, because, source tasks remain on the node on which they were started and fetch one split after the other (local ones first, remote ones later). Also race-conditions may happen if your splits are very small as the first data source task might rapidly request and process all splits before the other source tasks do their first request.
IIRC, the number of local and remote input split assignments is written to the JobManager logfile and might also be displayed in the web dashboard. That might help to debug the issue further. In case you identify a problem that does not seem to match with what I explained above, it would be great if you could get in touch with the Flink community via the user mailing list to figure out what the problem is.

MPI how to send and receive SQLite database

I have a big SQLite database to process, so I would like to use MPI for parallelization to accelerate the speed. What I want to do is sending a database from root to every slave, and sending the modified databases to root after slave add some table into it. I want to use MPI_Type_create_struct to create a datatype to store database, but the database is too complicated. IS there any other way to handle this situation? Thank you in advance!
I recently dealt with a similar problem - I have a large MPI application that uses SQLite as a configuration store. Handling multi-process writes is a challenge with an embedded SQL database. My experience with this involves a massively parallel application (running up to 65,535 ranks) with a shared filesystem.
Based on the FAQ from SQLite and some experience with database engines, there are a few ways to approach this problem. I am making the assumption that you are operating with a shared distributed file system, and multiple separate computers (a standard HPC cluster setup).
Since SQLite will block when multiple processes write to the database (but not read), reads will most likely not be an issue. Each process can run multiple SELECT commands at the same time without issue.
The challenge will be in the writing. Disk I/O is several orders of magnitude slower than computation, so generally this will be the bottleneck. Having said that, network communication may also be a significant slowdown, so how you approach the problem really depends on where the weakest link of your running environment will be.
If you have a fast network and slow disk speed, or if you want to implement this in the most straightforward way possible, your best bet is to have a single MPI rank in charge of writing to the database. Your compute processes would independently run SELECT commands until computation was complete, then send the new data to the MPI database process. The database control process would then write the new data to disk. I would not try to send the structure of the database across the network, rather I would send the data that should be written, along with (possibly) a flag that would identify what table/insert query the data should be written with. This technique is sort of similar to how a RDBMS works - while RDBMS servers do support concurrent writes, there is a "central" process in control of the ordering of write operations.
One thing to note is that if a process writes to the SQLite database, the file is locked for all processes that are trying to read or write to it. You will need to either handle the SQLITE_BUSY return code in your worker processes, register a callback to handle this, change the busy behavior, or use an alternate technique. In my application, I found that loading the database as an in-memory database, (https://www.sqlite.org/inmemorydb.html) for the readers provided a good workaround. Readers access the in-memory database, but sent results to the controlling process for writes. The downside is that you will have multiple copies of the database in memory.
Another option that might be less network intensive is to do the reads concurrently and have each worker process write out to their own file. You could write out to separate SQLite database files, or even export something like CSV (depending on the complexity of the data). When writes are complete, you would then have a single process merge the individual files into a single result database file - see How can I merge many SQLite databases?. This method has its own issues, but depending on where your bottlenecks are and how the system as a whole is laid out, this technique may work.
Finally, you might consider reading from the SQLite database and saving the data to a proper distributed file format, such as HDF5 (or using MPI IO). Once the computation is done, it would be pretty straightforward to write a script that would create a new SQLite database from this foreign file format.

What is scratch space /filesystem in HPC

I am studying about HPC applications and Parallel Filesystems. I came across the term scratch space AND scratch filesystem.
I cannot visualize where this scratch space exists. Is it on the compute node as a mounted filesystem /scratch or on the main storage space.
What are it's contents.
Is scratch space independent on each compute node or, two or more nodes can share a single scratch space.
So lets say I have a file 123.txt which I want to process parallelly. Will the scratch space contain the parts of this file or the whole file will be copied.
I am confused and nowhere on google is there a clear description. Please point out to some.
Thanks a Lot.
It all depends on how the cluster was setup and what the users need. When you are given access to a cluster you should also be given some information about how it is meant to be used which should answer most of your questions.
On one of the clusters I work with NFS is used for long term storage and some Lustre space is available for job scratch space. Both the NFS and Lustre are seen by all of the nodes. Each of the nodes also has some scratch space on the node that only that node can see.
If you want your job to work on 123.txt in parallel you can copy 123.txt to a shared scratch space(Lustre) or you can copy it to each of your node scratch spaces in your job file.
for i in `cat $PBS_NODEFILE | sort -u ` ; do scp 123.txt $i:/scratch ; done
Once each node has a copy you can run your job. Once the job is done you need to copy your results to persistent storage since clusters will often run scripts to cleanup scratch space.
There are a lot of different ways to think about or deploy scratch space or a scratch file system.
Let's say you have a cluster of linux nodes, and these nodes all have a hard disk. You could imagine a /scratch space, local to each node. Since the OS image is going to be relatively small, and one cannot procure anything smaller than a terabyte drive nowadays, you end up with close to a terabyte of storage for the node to use.
What would you do with this node-local storage? Oh, lots of things. Scalable Checkpoint-Restart. Local out-of-core operations.
When I first started playing with clusters, it seemed like a good idea to gang all this un-used space into a parallel file system. PVFS worked really well for that purpose.
which lets me segue to a /scratch parallel file system available to all nodes. There is a technology component to this (which parallel file system will a site deploy?) but there is also a policy component: how long will data on this file system be retained? is it backed up? /scratch often implies files are not backed up and in fact are purged after some period of not being accessed (typically two weeks)

Using a temporary database as an intermediate store in a pipeline?

I have a bioinformatics analysis program that is composed of 5 different steps. Each step is essentially a perl script that takes in input, does magic, and output several text files. Each step needs to be completely finished before the next starts. The entire process takes 24 hours or so on core i7 computers.
One major problem is that each step produces about 5-10 gigabytes of intermediate output text files needed by subsequent steps, and there's a bunch of redundancy. For example, the output of step 1 is used by step 2 and 3 and 4, and each one does the same preprocessing to it. This structure grew 'organically' b/c each step was developed independently. Doing everything in memory unfortunately will not work for us since data that is 10 gigs on-disk loaded into a perl hash/array is way too big for fit into memory.
It would be nice if the data could be loaded onto an intermediate database, processed once in a step, and be available in all subsequent steps. The data is essentially relational/tabular. Some of the steps only need access to data sequentially, while others need random access to files.
Does anyone have any experience in this sort of thing?
Which database would be right for such a task? I have used and liked SQLite, but does it scale to 20GB+ sizes? Can you tell postgresql or mysql to heavily cache data in memory? (I figure that databases written in C/C++ would be much more efficient memory-wise than perl hashes/arrays, so most of it could be cached in memory on 24GB machine). Or is there a better, non-rdbms related solution, given the overhead of creating, indexing, and subsequently destroying 20GB+ in a RDBMS for single-run analyses?
Have you looked at some of the NoSQL databases? They seem suited to your kind of work. I have used MongoDB for a high throughput application.
Here is a comparison of various nosql dbs.

Resources