How to run same proccesor flow with many parameters iteratively - loops

I want to run a shell script to generate data files. Have created two parameters date and source. The flow is working fine with one set of parameter. But I want to run for various parameters
Please advise.
Thanks

Related

How batch processing over multiple loops work in Apache Flink?

I wanted to clear my understanding on the following.
Use case
Basically I am running a flink batch job. My requirement is following
I have 10 tables having raw data in postgresql
I want to aggregate that data by creating a tumble window of 10 minutes
I need to store the aggregated data into aggregated postgresql tables
My pseudo code somewhat looks like this
initialize StreamExecutionEnvironment, StreamTableEnvironment
load all the configs from file
configs.foreach(
load data from table
aggregate
store data
delete temporary views created
)
streamExecutionEnvironment.execute()
Everything works fine for now. Still I have gotten one question. I think with this approach all the load functions would be executed simultaneously. So it would put load on flink right as all data is getting loaded simultaneously?? Or my understanding is wrong and the data would get loaded, processed and stored one by one?? please guide

Loading local CSV into snowflake

After 4 days of trying everything to load data into snowflake, nothing seems to work at all.
Now as my last option I want to load a local CSV file into snowflake in order to be able to follow the tutorial I am watching.
Unfortunately even this step seems to be a hard one in snowflake. I have seen, that I need to create an internal stage for this. Therefore I went to the Stage and created a "Snowflake Managed", which I think should be an internal stage. I called that Stage "MY_CSV_STAGE".
Internal Stage option on snowflake:
Then I went back to the worksheet and tried the following command:
PUT file://C:\Users\User\Downloads\Projekte/csv_dateien_fuer_snowflake.csv #MY_CSV_STAGE AUTO_COMPRESS=TRUE;
Now by trying to run the command I am just receiving a weired error, which I don't understand:
Error Message:
I really would like to understand what exactly I am doing wrong. I have also read on other places, that I should maybe need Snowsql to import data from local to snowflake. But the installation of the Snowsql I did not figure out.
How can I write this command line in snowflake in order to be able to import the CSV file?

Should I have to sumit jobs to spark or I can run them from client lib?

So I'm learning about Spark and I have a question about how client libs works.
My goal is to do some sort of data analysis in Spark, telling it where are the data sources (databases, cvs, etc) to process, and store results in hdfs, s3 or any kind of database like MariaDB or MongoDB.
I though about having a service (API application) that "tells" spark what I want to do. The question is: Is it enough setting the master configuration with spark:remote-host:7077 at context creation or should I send the application to spark with some sort of spark-submit command?
This completely depends on how your environment is set up, if all paths are linked to your account you should be able to run one of the two commands, to efficiently open the shell and run test commands. The reason to have a shell, is this will allow you to dynamically run commands and validate/learn how to run/tether commands onto one another and see what results come out.
Scala
spark-shell
Python
pyspark
Inside of the environment, if everything is linked to Hive tables you can check the tables by running
spark.sql("show tables").show(100,false)
The above command will run a "show tables" on the Spark-Hive-Metastore Catalogue and will return all active tables you can see (doesn't mean you can access the underlying data). The 100 means I am going to look at 100 rows and the false means to show the full string not the first N many characters.
In a mythical example if one of the tables you see is called Input_Table you can bring it into the environmrnt with the below commands
val inputDF = spark.sql("select * from Input_Table")
inputDF.count
I would heavily advise, while your learning, not to run the commands via Spark-Submit, because you will need to pass through the Class and Jar, forcing you to edit/rebuild for each testing making it difficult to figure how commands will run without have a lot of down time.

Passing SSIS parameters from Job

I'm trying to create a Job that will run my SSIS project.
In my project, I have 3 user-defined parameters, startDate,endDate,shiaruchDate .
I've been searching online for a while but couldn't locate an answer that will help me.
I'm creating a new job, then created a new step, but the Parameter tab is empty and I can't fill anything. Where and how do I define the 3 parameters to be an input? Where do I need to specify this ?
Thanks .
I assume you are using SSIS 2012 or above.
If you had user defined parameters you could use environment variables to configure the parameters in SSISDB.
This link has further info with screenshots, hope that helps.
http://www.sqlchick.com/entries/2013/9/15/getting-started-with-parameters-variables-configurations-in.html
Cheers
Nithin

SSIS Package - track calling job

I'm looking for ideas on how to automatically track the job that calls the package.
We have some genric packages that are called from different jobs, each job passes in different file paths as parameters and therefore processes very different size files depending on the path.
In the package I have some custom auditing setup which basically tracks the package start time and end time, and therefore the duration of execution. I want to be able to also track the job that called the package so if the package is running long, I can determine which job called it.
Also note I would prefer this automatic using possibly some sort of system variable or such, so that human error is not an issue. I also want these auditing tasks built into all of our packages as a template, so I would prefer not to use a user variable either - as different packages may use different variables.
Just looking for some ideas - appreciate any input
We use parent and child packages instead of different jobs calling the same package. You could send the information about which parent called it to the child package and then in the child package records that data to a table along with the start date and end date.
Our solution has a whole meta database that records all the details through logging of each step. The parent tells the child which configuration to use and log details against that configuration. The jobs call the parent package - never the child package (which doesn't have a configuration in the config table as it is always configured through variables sent in by the parent package. No human intervention necessary (except initial development or research when a failure occurs) needed.
Edit for existing jobs.
Consider that jobs can have multiple steps. Make the first step a SQL script that inserts the auditing information into a table including the start time of the package, the name of the job that called it and thename of the ssispacakge being called. Then the second step calls the SSIS package and then make the last step a SQL script that inserts the same data only with the end datetime.
A simple way to do this is to set up a variable on your SSIS package as a varchar. Set the value to the value of the variable to #[System::ParentContainerGUID] using an expression when it starts. SQL Agent won't set the value, so when run as an individual job it will be an empty string. But if called by another package it will contain the GUID of the calling package. You can test for that value. You can use a precedence contraint to control the program logic.
We have packages that run as a part of a big program but sometimes we need to run them individually. Each package has an email on failure task but we only want that to execute when the package is run individually. When it is part of the big run we collect the names of all packages that error and send them as one email from the master package. We don't want individual emails and a summary email going out on the same run.

Resources