Distribute data for hyperparameter optimization in pytorch - database

Question
How do I set up hyperparameter optimization on the same dataset on multiple servers (or containers) without duplicated data preprocessing?
The possible solution feels useful and like a common task, but I fear reimplementing existing code. Therefore, what is the framework I am searching for, or if there are multiple: What is the keyword I am missing?
Setting the scene
I built a package using neural networks for classification and want hyperparameter optimization.
I used pytorch-lightning and PyTorch Geometric combined with hydra
I have multiple servers on-premise with multiple GPUs each
Data preprocessing takes a fair amount of time
Data fits into each single GPU
Hyperparameters are network architecture but also variants of data preprocessing
Possible solution?
Usually, torch datasets:
download the raw dataset
preprocess it
are stored or loaded in memory to be fed into the neural network.
Now that feels inefficient when I want to optimize hyperparameters because this would have to be done at least on every server, if not even inside every container instance. Therefore I feel the following would be optimal:
Preprocess the raw dataset on one server instance with set i of preproc-hparams
Distribute the preprocessed data to all other servers
Test all hyperparameters
Go to 1. with the next set of preproc-hparams

Related

NVIDIA Triton vs TorchServe for SageMaker Inference

NVIDIA Triton vs TorchServe for SageMaker inference? When to recommend each?
Both are modern, production grade inference servers. TorchServe is the DLC default inference server for PyTorch models. Triton is also supported for PyTorch inference on SageMaker.
Anyone has a good comparison matrix for both?
Important notes to add here where both serving stacks differ:
TorchServe does not provide the Instance Groups feature that Triton does (that is, stacking many copies of the same model or even different models onto the same GPU). This is a major advantage for both realtime and batch use-cases, as the performance increase is almost proportional to the model replication count (i.e. 2 copies of the model get you almost twice the throughput and half the latency; check out a BERT benchmark of this here). Hard to match a feature that is almost like having 2+ GPU's for the price of one.
if you are deploying PyTorch DL models, odds are you often want to accelerate them with GPU's. TensorRT (TRT) is a compiler developed by NVIDIA that automatically quantizes and optimizes your model graph, which represents another huge speed up, depending on GPU architecture and model. It is understandably so probably the best way of automatically optimizing your model to run efficiently on GPU's and make good use of TensorCores. Triton has native integration to run TensorRT engines as they're called (even automatically converting your model to a TRT engine via config file), while TorchServe does not (even though you can use TRT engines with it).
There is more parity between both when it comes to other important serving features: both have dynamic batching support, you can define inference DAG's with both (not sure if the latter works with TorchServe on SageMaker without a big hassle), and both support custom code/handlers instead of just being able to serve a model's forward function.
Finally, MME on GPU (coming shortly) will be based on Triton, which is a valid argument for customers to get familiar with it so that they can quickly leverage this new feature for cost-optimization.
Bottom line I think that Triton is just as easy (if not easier) ot use, a lot more optimized/integrated for taking full advantage of the underlying hardware (and will be updated to keep being that way as newer GPU architectures are released, enabling an easy move to them), and in general blows TorchServe out of the water performance-wise when its optimization features are used in combination.
Because I don't have enough reputation for replying in comments, I write in answer.
MME is Multi-model endpoints. MME enables sharing GPU instances behind an endpoint across multiple models and dynamically loads and unloads models based on the incoming traffic.
You can read it further in this link

Manage multiple clusters in Hadoop OR Distributed Computing Framework

I have five computers networked together. Among them one is master computer and another four are slave computers.
Each slave computer has its own set of data (a very big integer matrix). I want to run four different clustering programs in four different slaves. Then, take the results back into the master computer for further processing (such as visualization).
I initially thought to use Hadoop. But, I cannot find any nice way to convert the above problem (specifically the output results) into the Map Reduce framework.
Is there any nice open-source distributed computing framework by using which I can perform the above task easily?
Thanks in advance.
You should used YARN for manage multiple clusters or resources
YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters.
Reference
It seems that you have already stored the data on each of the nodes, so you have already solved the "distributed storage" element of the problem.
Since each node's dataset is different, this isn't a parallel processing problem either.
It seems to me that you don't need Hadoop or any other big data framework. However, you can embrace the philosophy of Hadoop by taking the code to the data. You run the clustering algorithm on each node, and then handle the results in whatever way you need. A caveat would be if you also have a problem in loading the data and running the clustering algorithm on each node, but that is a different problem.

AWS Sagemaker custom user algorithms: how to take advantage of extra instances

This is a fundamental AWS Sagemaker question. When I run training with one of Sagemaker's built in algorithms I am able to take advantage of the massive speedup from distributing the job to many instances by increasing the instance_count argument of the training algorithm. However, when I package my own custom algorithm then increasing the instance count seems to just duplicate the training on every instance, leading to no speedup.
I suspect that when I am packaging my own algorithm there is something special I need to do to control how it handles the training differently for a particular instance inside of the my custom train() function (otherwise, how would it know how the job should be distributed?), but I have not been able to find any discussion of how to do this online.
Does anyone know how to handle this? Thank you very much in advance.
Specific examples:
=> It works well in a standard algorithm: I verified that increasing train_instance_count in the first documented sagemaker example speeds things up here: https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-train-model-create-training-job.html
=> It does not work in my custom algorithm. I tried taking the standard sklearn build-your-own-model example and adding a few extra sklearn variants inside of the training and then printing out results to compare. When I increase the train_instance_count that is passed to the Estimator object, it runs the same training on every instance, so the output gets duplicated across each instance (the printouts of the results are duplicated) and there is no speedup.
This is the sklearn example base: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb . The third argument of the Estimator object partway down in this notebook is what lets you control the number of training instances.
Distributed training requires having a way to sync the results of the training between the training workers. Most of the traditional libraries, such as scikit-learn are designed to work with a single worker, and can't just be used in a distributed environment. Amazon SageMaker is distributing the data across the workers, but it is up to you to make sure that the algorithm can benefit from the multiple workers. Some algorithms, such as Random Forest, are easier to take advantage of the distribution, as each worker can build a different part of the forest, but other algorithms need more help.
Spark MLLib has distributed implementations of popular algorithms such as k-means, logistic regression, or PCA, but these implementations are not good enough for some cases. Most of them were too slow and some even crushed when a lot of data was used for the training. The Amazon SageMaker team reimplemented many of these algorithms from scratch to benefit from the scale and economics of the cloud (20 hours of one instance costs the same as 1 hour of 20 instances, just 20 times faster). Many of these algorithms are now more stable and much faster beyond the linear scalability. See more details here: https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html
For the deep learning frameworks (TensorFlow and MXNet) SageMaker is using the built-in parameters server that each one is using, but it is taking the heavy lifting of the building the cluster and configuring the instances to communicate with it.

Where does Big Data go and how is it stored?

I'm trying to get to grips with Big Data, and mainly with how Big Data is managed.
I'm familiar with the traditional form of data management and data life cycle; e.g.:
Structured data collected (e.g. web form)
Data stored in tables in an RDBMS on a database server
Data cleaned and then ETL'd into a Data Warehouse
Data is analysed using OLAP cubes and various other BI tools/techniques
However, in the case of Big Data, I'm confused about the equivalent version of points 2 and 3, mainly because I'm unsure about whether or not every Big Data "solution" always involves the use of a NoSQL database to handle and store unstructured data, and also what the Big Data equivalent is of a Data Warehouse.
From what I've seen, in some cases NoSQL isn't always used and can be totally omitted - is this true?
To me, the Big Data life cycle goes something on the lines of this:
Data collected (structured/unstructured/semi)
Data stored in NoSQL database on a Big Data platform; e.g. HBase on MapR Hadoop distribution of servers.
Big Data analytic/data mining tools clean and analyse data
But I have a feeling that this isn't always the case, and point 3 may be totally wrong altogether. Can anyone shed some light on this?
When we talk about Big Data, we talk in most cases about huge amount of data that is many cases constantly written. Data can have a lot of variety as well. Think of a typical data source for Big Data as a machine in a production line that produces all the time sensor data on temperature, humidity, etc. Not the typical kind of data you would find in your DWH.
What would happen if you transform all this data to fit into a relational database? If you have worked with ETL a lot, you know that extracting from the source, transforming the data to fit into a schema and then to store it takes time and it is a bottle neck. Creating a schema is too slow. Also mostly this solution is to costly as you need expensive appliances to run your DWH. You would not want to fill it with sensor data.
You need fast writes on cheap hardware. With Big Data you store schemaless as first (often referred as unstructured data) on a distributed file system. This file system splits the huge data into blocks (typically around 128 MB) and distributes them in the cluster nodes. As the blocks get replicated, nodes can also go down.
If you are coming from the traditional DWH world, you are used to technologies that can work well with data that is well prepared and structured. Hadoop and co are good for looking for insights like the search for the needle in the hay stack. You gain the power to generate insights by parallelising data processing and you process huge amount of data.
Imagine you collected Terabytes of data and you want to run some analytical analysis on it (e.g. a clustering). If you had to run it on a single machine it would take hours. The key of big data systems is to parallelise execution in a shared nothing architecture. If you want to increase performance, you can add hardware to scale out horizontally. With that you speed up your search with a huge amount of data.
Looking at a modern Big Data stack, you have data storage. This can be Hadoop with a distributed file system such as HDFS or a similar file system. Then you have on top of it a resource manager that manages the access on the file system. Then again on top of it, you have a data processing engine such as Apache Spark that orchestrates the execution on the storage layer.
Again on the core engine for data processing, you have applications and frameworks such as machine learning APIs that allow you to find patterns within your data. You can run either unsupervised learning algorithms to detect structure (such as a clustering algorithm) or supervised machine learning algorithms to give some meaning to patterns in the data and to be able to predict outcomes (e.g. linear regression or random forests).
This is my Big Data in a nutshell for people who are experienced with traditional database systems.
Big data, simply put, is an umbrella term used to describe large quantities of structured and unstructured data that are collected by large organizations. Typically, the amounts of data are too large to be processed through traditional means, so state-of-the-art solutions utilizing embedded AI, machine learning, or real-time analytics engines must be deployed to handle it. Sometimes, the phrase "big data" is also used to describe tech fields that deal with data that has a large volume or velocity.
Big data can go into all sorts of systems and be stored in numerous ways, but it's often stored without structure first, and then it's turned into structured data clusters during the extract, transform, load (ETL) stage. This is the process of copying data from multiple sources into a single source or into a different context than it was stored in the original source. Most organizations that need to store and use big data sets will have an advanced data analytics solution. These platforms give you the ability to combine data from otherwise disparate systems into a single source of truth, where you can use all of your data to make the most informed decisions possible. Advanced solutions can even provide data visualizations for at a glance understanding of the information that was pulled, without the need to worry about the underlying data architecture.

Using ElasticSearch as source of truth

I am working with a team which uses two data sources.
MSSQL as a primary data source for making transaction calls.
ES as a back-up/read-only source of truth for viewing the data.
e.g. If I put an order, The order is inserted in DB, then there is a RabbitMQ listener/ Batch which then synchronizes the data from DB to ES.
Somehow this system fails for even just a million records. When I say fails, it means the records are not updated in ES in timely fashion, e.g. Say I create a coupon, then the coupon is generated in DB, when the coupon is generated, customer tries to redeem it immediately, although ES doesn't have the information about the coupon yet, so it fails. Of course there are options to use RabbitMQ's priority Queues etc, but the questions I have got are very basic
I have few questions in my mind, which I asked to the team, and still haven't got satisfactory answers
What is the minimum load should be expected when we use elastic search, and doesn't it become an overkill if we have just 1M records.
Does it really makes sense to use ES as source of truth for real-time data?
Is ES designed for handling relational-like databases, and to handle the data that gets continuously updated? AFAIK such search-optimized databases are once write, multiple read kind.
If we are doing it to handle load, then how will it be different than making a cluster of MSSQL databases as source of truth and using ES just for analytic?
The main question I have in mind is, how we can optimize this architecture so that we can scale better?
PS:
When I asked minimum load, what I really meant is what is the number of records/transaction for which we can say ES will be faster than conventional relational databases? Or there is no such term at all?
What is the minimum load should be expected when we use elastic search, and doesn't it become an overkill if we have just 1M records.
Answer: the possible load depends on the capabilities of your server
Does it really makes sense to use ES as source of truth for real-time data?
From ES website: "Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected."
So yes, it can be your source of truth, that said, it is "eventually consistent" which raises the question, how soon is it considered "real-time"... and there is no way to answer it without testing and measuring your system .
Is ES designed for handling relational-like databases, and to handle the data that gets continuously updated? AFAIK such
search-optimized databases are once write, multiple read kind.
That's a good point, as any eventual-consistent system, it is indeed NOT optimized to series of modifications!
If we are doing it to handle load, then how will it be different than making a cluster of MSSQL databases as source of truth and using
ES just for analytic?
It won't. Bare in mind that ES, as quoted above, was build to accommodate requirements of search and analysis. If that's not what you intend to do with it you should consider another tool. Use the right tool for the right job.
1)
There isn't a minimum expected load.
You can have 2 small nodes (master & data) with 2 shards per index (1 primary + 1 replica).
You can also split your data into multiple indices if it makes sense from a functional point of view (i.e. how data is searched).
2)
In my experience, the main benefits you get from ElasticSearch are:
Near linear scalability.
Lucene-based text search.
Many ways to put your data to work: RESTful query API, Kibana...
Easy administration (compared to your typical RDBMS).
If your project doesn't get these benefits, then most probably ES is not the right tool for the job.
3)
ElasticSearch doesn't like data that is updated frequently. The best use case is for read-only data.
Anyway, this doesn't explain the high latency you are getting; your problem must lie in RabbitMQ or the network.
4)
Indeed, that's what I would do: MSSQL cluster for application data and ES for analytics.

Resources