I am trying to build a training set for Sagemaker using the Linear Learner algorithm. This algorithm supports recordIO wrapped protobuf and csv as format for the training data. As the training data is generated using spark I am having issues to generate a csv file from a dataframe (this seem broken for now), so I am trying to use protobuf.
I managed to create a binary file for the training dataset using Protostuff which is a library that allows to generate protobuf messages from POJO objects. The problem is when triggering the training job I receive that message from SageMaker:
ClientError: No training data processed. Either the training channel is empty or the mini-batch size is too high. Verify that training data contains non-empty files and the mini-batch size is less than the number of records per training host.
The training file is certainly not null. I suspect the way I generate the training data to be incorrect as I am able to train models using the libsvm format. Is there a way to generate IOrecord using the Sagemaker java client ?
Answering my own question. It was an issue in the algorithm configuration. I reduced mini batch size and it worked fine.
Related
So as I don't get any help by reading documentations and blogposts I ll ask over here:
I want to deploy a Sagemaker Endpoint with fitting a Sagemaker Pipeline. I want to have an endpoint which is backed by a PipelineModel. This PipelineModel should consist of two models: A fitted model which encodes my data and a model which predicts with an XGBoost estimator. I follow along this docu: enter link description here
But this example doesn't show how to integrate the fitted preprocessor model in a PipelineStep. What Step do I have to use? A TrainingStep? Thanks in advance. I am desperate
Check out this official example: Train register and deploy a pipeline model.
The two variations to keep in mind:
For models that need training (usually for those based on tensorflow/pytorch), a TrainingStep must be used so that the output (the model artifact) is correctly (and automatically) generated with the ability to use it later for inference.
For models generated by a simple fitting on the data (e.g., a scaler with sklearn), you can think about creating a TrainingStep in disguised (it is an extra component in pipeline, it is not very correct to do it but it is a working round) but the more correct method is to configure the preprocessing script so that it internally saves a model.tar.gz file with the necessary files (e.g., pickle or joblib objects) inside it can then be properly used in later steps as model_data. In fact, if you have a model.tar.gz, you can define a Model of various types (e.g., an SKLearnModel) that is already fitted.
At this point, you define your PipelineModel with the trained/fitted models and can either proceed to direct endpoint deployment or decide to go through the model registry and keep a more robust approach.
I have been using AWS Sagemaker for years.
I don't understand why processing jobs can have several outputs ? In what kind of scenario, can you use more than one output destination ?
When I say several outputs I refer to ProcessingOutputConfig containing an array
A common reason to use SageMaker Processing is to do a train/test/validation split for machine learning
So, you'd want to output your training data to one S3 path, your test data to another, and your validation to yet another
I have modified this sample to read PDFs in tabular format. I would like to keep the tabular structure of the original pdf when doing the human review process. I notice the custom worker task template uses the crowd-entity-annotation element which seems to read only texts. I am aware that the human reviewer process reads from an S3 key which contains raw text written by the textract process.
I have been considering writing to S3 using tabulate but I don't think that is the best solution. I would like to keep the structure and still have the ability to annotate custom entities.
Comprehend now natively support to detect custom-defined entities for pdf documents. To do so, you can try the following steps:
Follow this github readme to start the annotation process for PDF documents.
Once the annotations are produced. You can use Comprehend CreateEntityRecognizer API to train a custom entity model for Semi-structured document”
Once entity recognizer is trained, you can use StartEntitiesDetectionJob API to run inference for PDF documents
Good morning all,
I am currently on a project in the field of Machine Learning, the goal is to make a supervised classification on a set of data. My data is a large number of pdf files, each file has a specific class, the goal is to use these files as a training dataset in order to do class prediction on new files.
My problem is that I don't know how to build my training dataset since the classification algorithm must train on the content of each file and in my training data frame I have the class of each file and the name of the file in question. How do I include the content of each pdf file in my training Data Frame?
Thank you in advance for your help
PDF files are usually characterized by text, images, charts or whatever, and so they cannot be easily transformed into vectors of numbers that can be given to a machine learning algorithm. First you need to extract information of interest from your files.
In this regard, you might want to try first some libraries which can be used to extract information, and see what happens. For Python, a good start can be PyPDF2. You can find a tutorial here.
If this is does not work as expected, my advice would be to try to use some OCR tools, which directly read the pdf as an image to extract information. In pytesseract is one of the most used, but it is not the only one.
I am currently working on completing my Masters Thesis project. In order to do so I need to be able to obtain the raw data accumulated in NagiosXI and/or OPSview. Because both of these are based off of the Nagios core, I assume the method to obtaining the raw data may be similar. This raw data is needed so that I can at a later time perform specific statical calculations which relate to my Masters Thesis. I have looked online and so far found some Nagios plugins which obtain raw data and then manipulate it for graphs and visuals, but I need the raw numbers in order to complete my calculations.
I am also researching to see if I can create maybe a PHP script, or some other language, that will extract the data from Nagios and save it in a word or excel document. However, this would be a bit of extra work as I am unfamiliar with either PHP or MySQL queries. Because of this I hope to be able to find a plugin, or something similar, that can get the data for me.
Cyanide,
I can't speak for NagiosXI, but I can for Opsview :)
You could access the data that is stored in the RRD files. You can use rrdtool dump to pull the values out or use a URL like: /rrdfetch?start=1307608993&end=1307695393&hsm=opsview%3A%3ACheck%20Loadavg%3A%3Aload1&hsm=opsview%3A%3ACheck%20Loadavg%3A%3Aload5
And this returns back the JSON data points. This is undocumented, but is used to power the interactive javascript graphing.
Alternatively, if you have ODW enabled with full statistics, then the raw data is stored in the ODW database and you can then extract the raw data with SQL commands. See http://docs.opsview.com/doku.php?id=opsview-community:odw for more information.
Ton
You can try use mk livestatus http://mathias-kettner.de/checkmk_livestatus.html
or http://exchange.nagios.org/directory/Addons/APIs/JSON/Nagios2JSON/details
All this tools get you status data without need to go to DB or status file. While XI is based on Nagios it can still work with him.
Please take a look at http://dmytro.github.com/nagira
It's a web services API to access Nagios data. You can get all hosts, service status data, objects configuration in multiple formats JSON, XML or YAML.