Sagemaker Batch Transform entry point - amazon-sagemaker

Before the AWS Sagemaker batch transform I need to do some transform. is it possible to have an custom script and associate as entry point to BatchTransformer?

SageMaker Batch Transformations do their transformations using a Model. However, this model can also be a Serial Inference Pipeline model, which is basically two or more models, one running after the other
https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html
So, your first model could be one that does some transformations, then the second model does your predictions
It depends on what kind of transformations you're hoping to do. If it's reasonably straightforward, then use the scikit-learn image

The inference code and requirement.txt should be stored as part of model.gz while training. They will be used in the batch transform!!

Related

Difference between SageMaker instance count and Data parallelism

I can't understand the difference between SageMaker instance count and Data parallelism. As we already have a feature that can specify how many instances we train model when we write a training script using sagemaker-sdk.
However, in 2021 re:Invent, SageMaker team launched and demonstrated SageMaker managed Data Parallelism and this feature also provides distributed training.
I've searched a lot of sites for letting me know about that, but I can't find really clear demonstration. I share some stuffs explaining the concept I mentioned closely. Link : https://godatadriven.com/blog/distributed-training-a-diy-aws-sagemaker-model/
Increasing the instance count will enable SageMaker to launch those many instances and copy data to the instances. This will only enable parallelization at the infrastructure level. To really carry out distributed training we need support at framework/code level where the code should know how to aggregate/send gradients across all the GPU's/instances within the cluster. In some case how to distribute data as well usually when using DataLoaders. To achieve this SageMaker has Distributed Data Parallelism feature built into it. This is similar to other alternatives like Horovod, Pytorch DDP etc...

What is the best way to represent data lineage in an image processing pipeline?

I am trying to determine the best way to represent data lineage for image processing. I have a images stored in S3 and I want to process them and then place them back in S3. I would then want to be able to run a query so I can see all the images and processes before and after in a chain. For example:
Image1 -ProcessA-> Image2 -ProcessB-> Image3
I would expect a search for the "lineage" of Image2 would yield the above information.
I know this looks like a cookie-cutter case for a graph database but I am not super familiar with them, especially for a production workflow. I have been fighting with how to implement this model in a relational database, but feel like I am just trying to put the square peg in the round hole.
Is a graph DB the only option? Which flavor would you suggest?
Is there a way to make this work in a relational model that I have not considered?
You are correct when you say this is a cookie-cutter case for a graph database, and any of the available graph database products will likely be able to meet your requirements. You can also solve this problem using a relational database but, as you indicated, it would be like putting a square peg in round hole.
Disclosure: I work for Objectivity, maker of the InfiniteGraph product.
I have solved similar data lineage problems using InfiniteGraph. The basic idea is to separate your data from your metadata. The "lineage" information is metadata. Let's put that in the graph database. The lineage information will include objects (nodes) that contain the metadata for images and the workflow process steps that consume images as input and generated images or other information as output.
We might define an ImageMD type in Infinite graph to contain the metadata for an image, including a URI that defines where the image data is currently stored, and the size and format of the image. We might define the ProcessMD type to describe an application that operates on image. It's attributes might include the name and version of the application as well as it deployment timestamp and host location where it is running.
You are going to end up with an environment that looks something like the following diagram.
Then, given an image, you can track its lineage backward to see its history and forward to see how it or it derivative components were evolved or used.
This is the basis for the Objectivity, Inc. application Metadata Connect.

Apache Flink load ML model from file

I'd like to know if there is a way(or some sort of code example) to load an encoded pre-trained model (written in python) inside a Flink streaming application.
So I can fit the model using the weights loaded from the file system and the data coming from from stream.
Thank you in advance
You can do this in a number of different ways. Generally, the simplest way would be to simply invoke the code that downloads the model from some external storage like s3 for example in the open method of your function. Then You can use the library of Your choice to load the pre-trained weights and process the data. You can look for some inspiration here, this is the code for loading model serialized with protobuf read from Kafka, but You can use it to understand the principles.
Normally I wouldn't recommend reading the model from the file system as it's much less flexible and troublesome to maintain. But that can be possible too, depending on Your infrastructure setup. The only thing, in that case, would be to assert that the file with the model is available on the machine that Pipeline will run on.

How to preprocess training data on s3 without using notebooks for built-in algorithms

I want to avoid using sagemaker notebook and preprocess data before training like simply changing the from csv to protobuf format as shown in the first link below for the built-in models.
https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-preprocess-data-transform.html
In the following example it explains preprocessing by using sklearn pipelines with the help of sagemaker python-sdk
https://aws.amazon.com/blogs/machine-learning/preprocess-input-data-before-making-predictions-using-amazon-sagemaker-inference-pipelines-and-scikit-learn/
What are the best practices if you just need to do format like changes and you don't need to use sklearn way of processing.
It's not necessary to use SageMaker Notebook instances to perform pre-processing or training. Notebooks are way to explore and carry out experiments. For production use cases, you can orchestrate tasks in a ML pipeline such as pre-processing, data preparation (feature engineering, format conversion etc.), model training and evaluation using AWS Step Functions. Julien has covered it in his recent talk here.
You can explore using AWS Glue for pre-processing either using Python script (via Python Shell) or Apache Spark (Glue job). Refer this blog here for such use case
https://aws.amazon.com/blogs/machine-learning/ensure-consistency-in-data-processing-code-between-training-and-inference-in-amazon-sagemaker/

Database recommendation

I'm writing a CAD (Computer-Aided Design) application. I'll need to ship a library of 3d objects with this product. These are simple objects made up of nothing more than 3d coordinates and there are going to be no more than about 300 of them.
I'm considering using a relational database for this purpose. But given my simple needs, I don't want any thing complicated. Till now, I'm leaning towards SQLite. It's small, runs within the client process and is claimed to be fast. Besides I'm a poor guy and it's free.
But before I commit myself to SQLite, I just wish to ask your opinion whether it is a good choice given my requirements. Also is there any equivalent alternative that I should try as well before making a decision?
Edit:
I failed to mention earlier that the above-said CAD objects that I'll ship are not going to be immutable. I expect the user to edit them (change dimensions, colors etc.) and save back to the library. I also expect users to add their own newly-created objects. Kindly consider this in your answers.
(Thanks for the answers so far.)
The real thing to consider is what your program does with the data. Relational databases are designed to handle complex relationships between sets of data. However, they're not designed to perform complex calculations.
Also, the amount of data and relative simplicity of it suggests to me that you could simply use a flat file to store the coordinates and read them into memory when needed. This way you can design your data structures to more closely reflect how you're going to be using this data, rather than how you're going to store it.
Many languages provide a mechanism to write data structures to a file and read them back in again called serialization. Python's pickle is one such library, and I'm sure you can find one for whatever language you use. Basically, just design your classes or data structures as dictated by how they're used by your program and use one of these serialization libraries to populate the instances of that class or data structure.
edit: The requirement that the structures be mutable doesn't really affect much with regard to my answer - I still think that serialization and deserialization is the best solution to this problem. The fact that users need to be able to modify and save the structures necessitates a bit of planning to ensure that the files are updated completely and correctly, but ultimately I think you'll end up spending less time and effort with this approach than trying to marshall SQLite or another embedded database into doing this job for you.
The only case in which a database would be better is if you have a system where multiple users are interacting with and updating a central data repository, and for a case like that you'd be looking at a database server like MySQL, PostgreSQL, or SQL Server for both speed and concurrency.
You also commented that you're going to be using C# as your language. .NET has support for serialization built in so you should be good to go.
I suggest you to consider using H2, it's really lightweight and fast.
When you say you'll have a library of 300 3D objects, I'll assume you mean objects for your code, not models that users will create.
I've read that object databases are well suited to help with CAD problems, because they're perfect for chasing down long reference chains that are characteristic of complex models. Perhaps something like db4o would be useful in your context.
How many objects are you shipping? Can you define each of these Objects and their coordinates in an xml file? So basically use a distinct xml file for each object? You can place these xml files in a directory. This can be a simple structure.
I would not use a SQL database. You can easy describe every 3D object with an XML file. Pack this files in a directory and pack (zip) all. If you need easy access to the meta data of the objects, you can generate an index file (only with name or description) so not all objects must be parsed and loaded to memory (nice if you have something like a library manager)
There are quick and easy SAX parsers available and you can easy write a XML writer (or found some free code you can use for this).
Many similar applications using XML today. Its easy to parse/write, human readable and needs not much space if zipped.
I have used Sqlite, its easy to use and easy to integrate with own objects. But I would prefer a SQL database like Sqlite more for applications where you need some good searching tools for a huge amount of data records.
For the specific requirement i.e. to provide a library of objects shipped with the application a database system is probably not the right answer.
First thing that springs to mind is that you probably want the file to be updatable i.e. you need to be able to drop and updated file into the application without changing the rest of the application.
Second thing is that the data you're shipping is immutable - for this purpose therefore you don't need the capabilities of a relational db, just to be able to access a particular model with adequate efficiency.
For simplicity (sort of) an XML file would do nicely as you've got good structure. Using that as a basis you can then choose to compress it, encrypt it, embed it as a resource in an assembly (if one were playing in .NET) etc, etc.
Obviously if SQLite stores its data in a single file per database and if you have other reasons to need the capabilities of a db in you storage system then yes, but I'd want to think about the utility of the db to the app as a whole first.
SQL Server CE is free, has a small footprint (no service running), and is SQL Server compatible

Resources