Distributed Training Terminology: Micro-batch and Per-Replica batch size - amazon-sagemaker

I am reading through the Sagemaker documentation on distributed training and confused on the terminology:
Mini-Batch, Micro-batch and Per-replica batch size
I understand that in data parallelism, there would be multiple copies of the model and each copy would receive data of size = "Per Replica Batch Size"
Could someone ELI5 how micro-batch would fit in this context?
Is this a common terminology used in the terminology or is this specific to AWS Sagemaker

Micro Batch comes into picture when you are using Model Parallel for training. In this case the model is sharded into multiple segments and loaded into different GPU's. In order to improve efficiency of GPU utilization model parallel training approaches will further divide the mini batch into micro batches. If you are using Data Parallel approach then you will only have global batch size and per replica batch size.


Partition data for AWS Athena results in a lot of small files in S3

I have a large dataset (>40G) which I want to store in S3 and then use Athena for query.
As suggested by this blog post, I could store my data in the following hierarchical directory structure to enable usingMSCK REPAIR to automatically add partitions while creating table from my dataset.
However, this requires me to split my dataset into many smaller data files and each will be stored under a nested folder depending on the partition keys.
Although using partition could reduce amount of data to be scanned by Athena and therefore speed up a query, would managing large amount of small files cause performance issue for S3? Is there a tradeoff here I need to consider?
Yes, you may experience an important decrease of efficiency with small files and lots of partitions.
Here there is a good explanation and suggestion on file sizes and number of partitions, which should be larger than 128 MB to compensate the overhead.
Also, I performed some experiments in a very small dataset (1 GB), partitioning my data by minute, hour and day. The scanned data decreases when you make the partitions smaller, but the time spent on the query will increase a lot (40 times slower in some experiments).
I will try to get into it without veering too much into the realm of opinion.
For the use cases which I have used Athena, 40 GB is actually a very small dataset by the standards of what the underlying technology (Presto) is designed to handle. According to the Presto web page, Facebook uses the underlying technology to query their 300 PB data warehouse. I routinely use it on datasets between 500 GB and 1 TB in size.
Considering the underlying S3 technology, S3 was used to host Dropbox and Netflix, so I doubt most enterprises could come anywhere near taxing the storage infrastructure. Where you may have heard about performance issues and S3 relates to websites storing multiple, small, pieces of static content on many files scattered across S3. In this case, a delay in retrieving one of these small pieces of content might affect user experience on the larger site.
Related Reading:

Is there some kind of persistent local storage in aws sagemaker model training?

I did some experimentation with aws sagemaker, and the download time of large data sets from S3 is very problematic, especially when the model is still in development, and you want some kind of initial feedback relatively fast
Is there some kind of local storage or other way to speed things up?
I refer to the batch training service, that allows you to submit a job as a docker container.
While this service is intended for already validated jobs that typically run for a long time (which makes the download time less significant) there's still a need for quick feedback
There's no other way to do the "integration" testing of your job with the sagemaker infrastructure (configuration files, data files, etc.)
When experimenting with different variations to the model, it's important to be able to get initial feedback relatively fast
SageMaker has a few distinct services in it, and each is optimized for a specific use case. If you are talking about the development environment, you are probably using the notebook service. The notebook instance is coming with a local EBS (5GB) that you can use to copy some data into it and run the fast development iterations without copying the data every time from S3. The way to do it is by running wget or aws s3 cp from the notebook cells or from the terminal that you can open from the directory list page.
Nevertheless, it is not recommended to copy too much data into the notebook instance, as it will cause your training and experiments to take too long. Instead, you should utilize the second part of SageMaker, which is the training service. Once you have a good sense of the model that you want to train, based on the quick iterations of the small datasets on the notebook instance, you can point your model definition to go over larger datasets in parallel across a cluster of training instances. When you are sending a training job, you can also define how much local storage will be used by each training instance, but you will most benefit from the distributed mode of the training.
When you want to optimize your training job you have a few options for the storage. First, you can define the size of the EBS volume that you want your model to train on, for each one of the cluster instances. You can specify it when you launch the training Job (https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html ):
"ResourceConfig": {
"InstanceCount": number,
"InstanceType": "string",
"VolumeKmsKeyId": "string",
"VolumeSizeInGB": number
Next, you need to decide what kind of models you want to train. If you are training your own models, you know how these models are getting their data, in terms of format, compression, source and other factors that can impact the performance of loading that data into the model input. If you prefer to use the built-in algorithms that SageMaker has, which are optimized to process protobuf RecordIO format. See more information here: https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html
Another aspect that you can benefit from (or learn if you want to implement your own models in a more scalable and optimized way) is the TrainingInputMode (https://docs.aws.amazon.com/sagemaker/latest/dg/API_AlgorithmSpecification.html#SageMaker-Type-AlgorithmSpecification-TrainingInputMode):
Type: String
Valid Values: Pipe | File
Required: Yes
You can use the File mode to read the data files from S3. However, you can also use the Pipe mode which opens up a lot of options to process data in a streaming mode. It doesn't mean only real-time data, using streaming services such as AWS Kinesis or Kafka, but also you can read your data from S3 and stream it to the models, and completely avoid the need to store the data locally on the training instances.
Customize your notebook volume size, up to 16 TB, with Amazon SageMaker
Blockquote Amazon SageMaker now allows you to customize the notebook storage volume when you need to store larger amounts of data.
Blockquote Allocating the right storage volume for your notebook instance is important while you develop machine learning models. You can use the storage volume to locally process a large dataset or to temporarily store other data to work with.
Blockquote Every notebook instance you create with Amazon SageMaker comes with a default storage volume of 5 GB. You can choose any size between 5 GB and 16384 GB, in 1 GB increments.
When you create notebook instances using the Amazon SageMaker console, you can define the storage volume:
see the steps

SQL to text archiving using .net and parallel processing

I have a table that has millions of rows. It has logging data. I want to move the data to text files. Each day's worth of data should go into its own text file. I'm in .net environment. What is the efficient way to achieve it ?
I want to use parallel processing because we have beefy servers with many cores. Some choices I can think of are :
Have parallel data readers. Each reader queries a portion of the data. How do I manage the total connections with this approach ? Also if I went this route, I will have to not disrupt the normal usage for the users. The other problem I can see with this approach is managing my own threads and setting an upper limit, whereas Parallel.ForEach would be much simpler.
Producer-consumer pattern: One thread reads the data and queues it in memory. Multiple writers consume the data from memory and write it out to text files.
I'm open to PetaPoco/NPoco. Ideally I want to use Parallel.ForEach without complicating the threading code too much.
Parallel processing helps when there is a lot of computing involved. However, here, you have mainly I/O involved. Harddisks can only write to one file at a time. So multithreading will not bring the hoped-speed growth. It could, in contrary, reduce speed, since the harddisk could be forced to move back and fourth when writing to the different files.

Flink batch: data local planning on HDFS?

we've been playing a bit with Flink. So far we've been using Spark and standard M/R on Hadoop 2.x / YARN.
Apart from the Flink execution model on YARN, that AFAIK is not dynamic like spark where executors dynamically take and release virtual-cores in YARN, the main point of the question is as follows.
Flink seems just amazing: for streaming API's, I'd only say that it's brilliant and over the top.
Batch API's: processing graphs are very powerful and are optimised and run in parallel in a unique way, leveraging cluster scalability much more than Spark and others, optiziming perfectly very complex DAG's that share common processing steps.
The only drawback I found, that I hope is just my misunderstanding and lack of knowledge is that it doesn't seem to prefer data-local processing when planning the batch jobs that use input on HDFS.
Unfortunately it's not a minor one because in 90% use cases you have a big-data partitioned storage on HDFS and usually you do something like:
read and filter (e.g. take only failures or successes)
aggregate, reduce, work with it
The first part, when done in simple M/R or spark, is always planned with the idiom of 'prefer local processing', so that data is processed by the same node that keeps the data-blocks, to be faster, to avoid data-transfer over the network.
In our tests with a cluster of 3 nodes, setup to specifically test this feature and behaviour, Flink seemed to perfectly cope with HDFS blocks, so e.g. if file was made up of 3 blocks, Flink was perfectly handling 3 input-splits and scheduling them in parallel.
But w/o the data-locality pattern.
Please share your opinion, I hope I just missed something or maybe it's already coming in a new version.
Thanks in advance to anyone taking the time to answer this.
Flink uses a different approach for local input split processing than Hadoop and Spark. Hadoop creates for each input split a Map task which is preferably scheduled to a node that hosts the data referred by the split.
In contrast, Flink uses a fixed number of data source tasks, i.e., the number of data source tasks depends on the configured parallelism of the operator and not on the number of input splits. These data source tasks are started on some node in the cluster and start requesting input splits from the master (JobManager). In case of input splits for files in an HDFS, the JobManager assigns the input splits with locality preference. So there is locality-aware reading from HDFS. However, if the number of parallel tasks is much lower than the number of HDFS nodes, many splits will be remotely read, because, source tasks remain on the node on which they were started and fetch one split after the other (local ones first, remote ones later). Also race-conditions may happen if your splits are very small as the first data source task might rapidly request and process all splits before the other source tasks do their first request.
IIRC, the number of local and remote input split assignments is written to the JobManager logfile and might also be displayed in the web dashboard. That might help to debug the issue further. In case you identify a problem that does not seem to match with what I explained above, it would be great if you could get in touch with the Flink community via the user mailing list to figure out what the problem is.

Using a temporary database as an intermediate store in a pipeline?

I have a bioinformatics analysis program that is composed of 5 different steps. Each step is essentially a perl script that takes in input, does magic, and output several text files. Each step needs to be completely finished before the next starts. The entire process takes 24 hours or so on core i7 computers.
One major problem is that each step produces about 5-10 gigabytes of intermediate output text files needed by subsequent steps, and there's a bunch of redundancy. For example, the output of step 1 is used by step 2 and 3 and 4, and each one does the same preprocessing to it. This structure grew 'organically' b/c each step was developed independently. Doing everything in memory unfortunately will not work for us since data that is 10 gigs on-disk loaded into a perl hash/array is way too big for fit into memory.
It would be nice if the data could be loaded onto an intermediate database, processed once in a step, and be available in all subsequent steps. The data is essentially relational/tabular. Some of the steps only need access to data sequentially, while others need random access to files.
Does anyone have any experience in this sort of thing?
Which database would be right for such a task? I have used and liked SQLite, but does it scale to 20GB+ sizes? Can you tell postgresql or mysql to heavily cache data in memory? (I figure that databases written in C/C++ would be much more efficient memory-wise than perl hashes/arrays, so most of it could be cached in memory on 24GB machine). Or is there a better, non-rdbms related solution, given the overhead of creating, indexing, and subsequently destroying 20GB+ in a RDBMS for single-run analyses?
Have you looked at some of the NoSQL databases? They seem suited to your kind of work. I have used MongoDB for a high throughput application.
Here is a comparison of various nosql dbs.
