What is scratch space /filesystem in HPC - filesystems

I am studying about HPC applications and Parallel Filesystems. I came across the term scratch space AND scratch filesystem.
I cannot visualize where this scratch space exists. Is it on the compute node as a mounted filesystem /scratch or on the main storage space.
What are it's contents.
Is scratch space independent on each compute node or, two or more nodes can share a single scratch space.
So lets say I have a file 123.txt which I want to process parallelly. Will the scratch space contain the parts of this file or the whole file will be copied.
I am confused and nowhere on google is there a clear description. Please point out to some.
Thanks a Lot.

It all depends on how the cluster was setup and what the users need. When you are given access to a cluster you should also be given some information about how it is meant to be used which should answer most of your questions.
On one of the clusters I work with NFS is used for long term storage and some Lustre space is available for job scratch space. Both the NFS and Lustre are seen by all of the nodes. Each of the nodes also has some scratch space on the node that only that node can see.
If you want your job to work on 123.txt in parallel you can copy 123.txt to a shared scratch space(Lustre) or you can copy it to each of your node scratch spaces in your job file.
for i in `cat $PBS_NODEFILE | sort -u ` ; do scp 123.txt $i:/scratch ; done
Once each node has a copy you can run your job. Once the job is done you need to copy your results to persistent storage since clusters will often run scripts to cleanup scratch space.

There are a lot of different ways to think about or deploy scratch space or a scratch file system.
Let's say you have a cluster of linux nodes, and these nodes all have a hard disk. You could imagine a /scratch space, local to each node. Since the OS image is going to be relatively small, and one cannot procure anything smaller than a terabyte drive nowadays, you end up with close to a terabyte of storage for the node to use.
What would you do with this node-local storage? Oh, lots of things. Scalable Checkpoint-Restart. Local out-of-core operations.
When I first started playing with clusters, it seemed like a good idea to gang all this un-used space into a parallel file system. PVFS worked really well for that purpose.
which lets me segue to a /scratch parallel file system available to all nodes. There is a technology component to this (which parallel file system will a site deploy?) but there is also a policy component: how long will data on this file system be retained? is it backed up? /scratch often implies files are not backed up and in fact are purged after some period of not being accessed (typically two weeks)

Related

rclone slow transfer from bucket to filesystem

Im using rclone to tranfer data between a minio bucket and a shared storage. Im migrating a store and The amount of data is around 200GB of product pictures. Every single picture have his own folder/path. So there are a lot of folders that needs to create to. Rclone is installed on the new server and the storage is connected to the server via san. The transfer is running over a week and we are at 170GB right now. Everything works fine but it is really slow in my opinion. Is it normal that a transfer out of a bucket into a classic filesystem is that slow?
(Doing the math, the speed is only 2.3Mbps. I am honestly not going to pay anything for that speed.)
Perhaps you should break down the issue and diagnose part by part. Below are several common places to look out for slow transfer (generally speaking for any file transfer):
First of all, network and file systems are usually not performant with lots of small files, so to isolate the issue, upload a bigger file to minio first (1GB+). And for each step, test with big file first.
Is the speed of the source fast enough? Try copying the files from minio to a local storage or Ramdisk (/tmp is usually tmpfs and in turn stored in RAM, use mount to check).
Is the speed of the destination fast enough? Try dd or other disk performance testing utility.
Is the network latency to source high? Try pinging or curling the API (with timing)
Is the network latency to destination high? Try iostat
Maybe the CPU is the bottleneck? As encoding and decoding stuff takes quite a lot of computing power. Try top when a copy is running.
Again, try these steps with the big file and fragmented file separately. The is quite a chance that fragmented files is an issue. If that is the case, I would try to look for concurrency option in rclone.
I had the same problem copying hundreds of thousands of small files from a S3-compatible storage to a local storage. Originally I was using s3fs+rsync. Very (very) slow, and it was getting stuck on the largest folders. Then I discovered rclone, and finished the migration within a few hours with these parameters:
rclone copy source:/bucket /destination/folder --checkers 256 --transfers 256 --fast-list --size-only --progress
Explanation of the options (from https://rclone.org/flags/)
--checkers 256 Number of checkers to run in parallel (default 8)
--transfers 256 Number of file transfers to run in parallel (default 4)
--fast-list Use recursive list if available; uses more memory but fewer transactions
--size-only Skip based on size only, not mod-time or checksum (wouldn't apply in your case if copying to an empty destination)
--progress Show progress during transfer

Flink batch: data local planning on HDFS?

we've been playing a bit with Flink. So far we've been using Spark and standard M/R on Hadoop 2.x / YARN.
Apart from the Flink execution model on YARN, that AFAIK is not dynamic like spark where executors dynamically take and release virtual-cores in YARN, the main point of the question is as follows.
Flink seems just amazing: for streaming API's, I'd only say that it's brilliant and over the top.
Batch API's: processing graphs are very powerful and are optimised and run in parallel in a unique way, leveraging cluster scalability much more than Spark and others, optiziming perfectly very complex DAG's that share common processing steps.
The only drawback I found, that I hope is just my misunderstanding and lack of knowledge is that it doesn't seem to prefer data-local processing when planning the batch jobs that use input on HDFS.
Unfortunately it's not a minor one because in 90% use cases you have a big-data partitioned storage on HDFS and usually you do something like:
read and filter (e.g. take only failures or successes)
aggregate, reduce, work with it
The first part, when done in simple M/R or spark, is always planned with the idiom of 'prefer local processing', so that data is processed by the same node that keeps the data-blocks, to be faster, to avoid data-transfer over the network.
In our tests with a cluster of 3 nodes, setup to specifically test this feature and behaviour, Flink seemed to perfectly cope with HDFS blocks, so e.g. if file was made up of 3 blocks, Flink was perfectly handling 3 input-splits and scheduling them in parallel.
But w/o the data-locality pattern.
Please share your opinion, I hope I just missed something or maybe it's already coming in a new version.
Thanks in advance to anyone taking the time to answer this.
Flink uses a different approach for local input split processing than Hadoop and Spark. Hadoop creates for each input split a Map task which is preferably scheduled to a node that hosts the data referred by the split.
In contrast, Flink uses a fixed number of data source tasks, i.e., the number of data source tasks depends on the configured parallelism of the operator and not on the number of input splits. These data source tasks are started on some node in the cluster and start requesting input splits from the master (JobManager). In case of input splits for files in an HDFS, the JobManager assigns the input splits with locality preference. So there is locality-aware reading from HDFS. However, if the number of parallel tasks is much lower than the number of HDFS nodes, many splits will be remotely read, because, source tasks remain on the node on which they were started and fetch one split after the other (local ones first, remote ones later). Also race-conditions may happen if your splits are very small as the first data source task might rapidly request and process all splits before the other source tasks do their first request.
IIRC, the number of local and remote input split assignments is written to the JobManager logfile and might also be displayed in the web dashboard. That might help to debug the issue further. In case you identify a problem that does not seem to match with what I explained above, it would be great if you could get in touch with the Flink community via the user mailing list to figure out what the problem is.

Any distributed file system which support constant time cloning

lustre, or google file system(GFS) split a file into some kinds of block, and save them to various nodes. So they can acheive scalability, distributed traffic.
ZFS, btrfs, wafl support constant time cloning. By this, they can achieve cloning speed, writable snapshot, saving storage size.
I have been founding any file system which support above two feature.
Though there are a lot file system which support constant time cloning. but I can't find any distributed file system which can support constant time cloning. Lustre team look like developing lustre supporting zfs(and also support cloning). but it revealed yet(moreover it doesn't include 2.0 beta, maybe it will not be revealed in short time).
Nexenta storage seemed like supporting these feature by "namespace nfs". but it wasn't. it just distribute file by file-level distribution. It means, if some file exceed size of volume of one node, it will not able to handle it. If a lot of cloned files grow to big file, they can't handle that(at least, they have to really copy(not shadowing nodes) original file to other node. maybe i can attach SAN disks to zvolume of a ZFS node. but I'm very worry about concentrated traffic of ZFS node.
so I'm looking for a file system or a solution which can handle above two issue.
One working solution is to combine the Lustre filesystem with Robinhood Policy Engine in backup mode to constantly backup your filesystem files. This mode makes it possible to backup a Lustre v2.x filesystem to an external storage. It tracks modifications in the filesystem thanks to Lustre 2+ changelogs feature (FS events), and copy modified files to the backend storage, according to admin-defined migration policies. You can configure your own upcall commands in Robinhood, for example to provide a scalable way to clone your filesystem and schedule sync tasks on several nodes.
With Lustre on ZFS, it should be possible to use ZFS snapshot feature, but even the ZFS stack is not yet ready for production (currently tested on top 1 supercomputer Sequoia at LLNL).

MapReduce or a batch job?

I have a function which needs to be called on a lot of files (1000's). Each is independent of another, and can be run in parallel. The output of the function for each of the files does not need to be combined (currently) with the other ones. I have a lot of servers I can scale this on but I'm not sure what to do:
1) Run a MapReduce on it
2) Create 1000's of jobs (each has a different file it works on).
Would one solution be preferable to another?
Thanks!
MapReduce will provide significant value for distributing large dataset workloads. In your case, being smaller independent jobs on small independent data files, in my opinion it could be overkill.
So, I would prefer run a bunch of dynamically created batch files.
Or, alternatively, use a cluster manager and job scheduler, like SLURM https://computing.llnl.gov/linux/slurm/
SLURM: A Highly Scalable Resource Manager
SLURM is an open-source resource manager designed for Linux clusters
of all sizes. It provides three key functions. First it allocates
exclusive and/or non-exclusive access to resources (computer nodes) to
users for some duration of time so they can perform work. Second, it
provides a framework for starting, executing, and monitoring work
(typically a parallel job) on a set of allocated nodes. Finally, it
arbitrates contention for resources by managing a queue of pending
work.
Since it is only 1000's of files (and not 1000000000's of files) a full blown HADOOP setup is probably overkill. GNU Parallel tries to fill the gap between sequential scripts and HADOOP:
ls files | parallel -S server1,server2 your_processing {} '>' out{}
You will probably want to learn about --sshloginfile. Depending on where the files are stored you may want to learn --trc, too.
Watch the intro video to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ
Use a job array in slurm. No need to submit 1000s of jobs...just 1 - the array job.
This will kick off the same program on as many nodes / cores as are available with the resources you specify. Eventually it will churn through them all. Your only issue is how to map the array index to a file to process. Simplest way would be to prepare a text file with a list of all the paths, one per line. Each element of the job-array will get the ith line of this file and use that as the path of the file to process.

Using a temporary database as an intermediate store in a pipeline?

I have a bioinformatics analysis program that is composed of 5 different steps. Each step is essentially a perl script that takes in input, does magic, and output several text files. Each step needs to be completely finished before the next starts. The entire process takes 24 hours or so on core i7 computers.
One major problem is that each step produces about 5-10 gigabytes of intermediate output text files needed by subsequent steps, and there's a bunch of redundancy. For example, the output of step 1 is used by step 2 and 3 and 4, and each one does the same preprocessing to it. This structure grew 'organically' b/c each step was developed independently. Doing everything in memory unfortunately will not work for us since data that is 10 gigs on-disk loaded into a perl hash/array is way too big for fit into memory.
It would be nice if the data could be loaded onto an intermediate database, processed once in a step, and be available in all subsequent steps. The data is essentially relational/tabular. Some of the steps only need access to data sequentially, while others need random access to files.
Does anyone have any experience in this sort of thing?
Which database would be right for such a task? I have used and liked SQLite, but does it scale to 20GB+ sizes? Can you tell postgresql or mysql to heavily cache data in memory? (I figure that databases written in C/C++ would be much more efficient memory-wise than perl hashes/arrays, so most of it could be cached in memory on 24GB machine). Or is there a better, non-rdbms related solution, given the overhead of creating, indexing, and subsequently destroying 20GB+ in a RDBMS for single-run analyses?
Have you looked at some of the NoSQL databases? They seem suited to your kind of work. I have used MongoDB for a high throughput application.
Here is a comparison of various nosql dbs.

Resources