I have been using AWS Sagemaker for years.
I don't understand why processing jobs can have several outputs ? In what kind of scenario, can you use more than one output destination ?
When I say several outputs I refer to ProcessingOutputConfig containing an array
A common reason to use SageMaker Processing is to do a train/test/validation split for machine learning
So, you'd want to output your training data to one S3 path, your test data to another, and your validation to yet another
Related
I wanted to clear my understanding on the following.
Use case
Basically I am running a flink batch job. My requirement is following
I have 10 tables having raw data in postgresql
I want to aggregate that data by creating a tumble window of 10 minutes
I need to store the aggregated data into aggregated postgresql tables
My pseudo code somewhat looks like this
initialize StreamExecutionEnvironment, StreamTableEnvironment
load all the configs from file
configs.foreach(
load data from table
aggregate
store data
delete temporary views created
)
streamExecutionEnvironment.execute()
Everything works fine for now. Still I have gotten one question. I think with this approach all the load functions would be executed simultaneously. So it would put load on flink right as all data is getting loaded simultaneously?? Or my understanding is wrong and the data would get loaded, processed and stored one by one?? please guide
So I'm learning about Spark and I have a question about how client libs works.
My goal is to do some sort of data analysis in Spark, telling it where are the data sources (databases, cvs, etc) to process, and store results in hdfs, s3 or any kind of database like MariaDB or MongoDB.
I though about having a service (API application) that "tells" spark what I want to do. The question is: Is it enough setting the master configuration with spark:remote-host:7077 at context creation or should I send the application to spark with some sort of spark-submit command?
This completely depends on how your environment is set up, if all paths are linked to your account you should be able to run one of the two commands, to efficiently open the shell and run test commands. The reason to have a shell, is this will allow you to dynamically run commands and validate/learn how to run/tether commands onto one another and see what results come out.
Scala
spark-shell
Python
pyspark
Inside of the environment, if everything is linked to Hive tables you can check the tables by running
spark.sql("show tables").show(100,false)
The above command will run a "show tables" on the Spark-Hive-Metastore Catalogue and will return all active tables you can see (doesn't mean you can access the underlying data). The 100 means I am going to look at 100 rows and the false means to show the full string not the first N many characters.
In a mythical example if one of the tables you see is called Input_Table you can bring it into the environmrnt with the below commands
val inputDF = spark.sql("select * from Input_Table")
inputDF.count
I would heavily advise, while your learning, not to run the commands via Spark-Submit, because you will need to pass through the Class and Jar, forcing you to edit/rebuild for each testing making it difficult to figure how commands will run without have a lot of down time.
I am writing a batch job with Apache Flink using the DataSet API. I can read a text file using readTextFile() but this function just read one file at once.
I would like to be able to consume all the text files in my directory one by one and process them at the same time one by one, in the same function as a batch job with the DataSet API, if it is possible.
Other option is implement a loop doing multiple jobs, one for each file, instead of one job, with multiples files. But I think this solution is not the best.
Any suggestion?
If I got the documentation right you can read an entire path using ExecutionEnvironment.readTextFile(). You can find an example here: Word-Count-Batch-Example
References:
Flink Documentation
Flink Sources
I am trying to process some files on HDFS and write the results back to HDFS too. The files are already prepared before job starts. The thing is I want to write to different paths and files based on the file content. I am aware that BucketingSink(doc here) is provided to achieve this in Flink streaming. However, it seems that Dataset does not have a similar API. I have found out some Q&As on stackoverflow.(1, 2, 3). Now I think I have two options:
Use Hadoop API: MultipleTextOutputFormat or MultipleOutputs;
Read files as stream and use BucketingSink.
My question is how to make a choice between them, or is there another solution ? Any help is appreciated.
EDIT: This question may be a duplicate of this .
We faced the same problem. We too are surprised that DataSet does not support addSink().
I recommend not switching to Streaming mode. You might give up some optimizations (i.e Memory pools) that are available in batch mode.
You may have to implement your own OutputFormat to do the bucketing.
Instead, you can extend the OutputFormat[YOUR_RECORD] (or RichOutputFormat[]) where you can still use the BucketAssigner[YOUR_RECORD, String] to open/write/close output streams.
That's what we did and it's working great.
I hope flink would support this soon in Batch Mode soon.
I am trying to build a training set for Sagemaker using the Linear Learner algorithm. This algorithm supports recordIO wrapped protobuf and csv as format for the training data. As the training data is generated using spark I am having issues to generate a csv file from a dataframe (this seem broken for now), so I am trying to use protobuf.
I managed to create a binary file for the training dataset using Protostuff which is a library that allows to generate protobuf messages from POJO objects. The problem is when triggering the training job I receive that message from SageMaker:
ClientError: No training data processed. Either the training channel is empty or the mini-batch size is too high. Verify that training data contains non-empty files and the mini-batch size is less than the number of records per training host.
The training file is certainly not null. I suspect the way I generate the training data to be incorrect as I am able to train models using the libsvm format. Is there a way to generate IOrecord using the Sagemaker java client ?
Answering my own question. It was an issue in the algorithm configuration. I reduced mini batch size and it worked fine.