Execute Script after two separate ADFv2 Pipelines have completed - sql-server

I have two ADFv2 Pipelines that import data into two seperate srl tables within an Azure SQL database. Once both the pipelines have completed I would need to execute a script.
The source .csv files that initiates the execution of each individual pipeline will be created on a daily basis, but I can only execute the script when both Pipelines have completed...
each seperate pipeline is triggered via a Logic App by the creation of a seperate .csv file
I can use Logic Apps as well, but at the moment I can't find the best process to implement this.
Any help greatly appreciated.

2 situation:
1.If you don't mind the pipeline linear execution,you could use Execute Pipeline Activity. Execute function until the first two Execute Pipeline Activity executes successfully, like this process:
2.If not, my idea is using queue trigger. After pipeline execution, send a message to azure queue storage by for example Web Activity(REST API). Configure a function queue trigger, judge if it receive 2 successful messages,then do some jobs.
Of course, you could use ADF monitor SDKs to de-polling to check the execution status and results of two pipelines and do the next jobs. You could pick a suitable solution.
Besides, you could get an idea of Logic App as you mentioned in the answer.It supports run after for 2 connectors. Both of them are successful, then do the next job.

Related

Automate the execution of a C# code that uses Entity Framework to treat data?

I have code that uses Entity Framework to treat data (retrieves data from multiple tables then performs operations on it before saving in a SQL database). The code was supposed to run when a button is clicked in an MVC web application that I created. But now the client wants the data treatment to run automatically every day at a set time (like an SSIS package). How do I go about this?
But now the client wants the data treatment to run automatically every day at a set time (like an SSIS package). How do I go about this?
In addition to adding a job scheduler to your MVC application as #Pac0 suggests, here are a couple of other options:
Leave the code in the MVC project and create an API endpoint that you can invoke on some sort of schedule. Give the client a PowerShell script that calls the API and let them take it from there.
Or
Refactor the code into a .DLL or copy/paste it into a console application that can be run on a schedule using the Windows Scheduler, SQL Agent or some other external scheduler.
You could use some tool/lib that does this for you. I could recommend Hangfire, it works fine (there are some others, I have not tried them).
The example on their homepage is pretty explicit :
RecurringJob.AddOrUpdate(
() => Console.WriteLine("Recurring!"),
Cron.Daily);
The above code needs to be executed once when your application has started up, and you're good to go. Just replace the lambda by a call to your method.
Adapt the time parameter on what you wish, or even better: make it configurable, because we know customers like to change their mind.
Hangfire needs to create its own database, usually it will stay pretty small for this kind of things. You can also monitor if the jobs ran well or not, and check no the hangfire server some useful stats.

Snowflake Snowpipe - Email Alert Mechanism

I am planning to use Snowpipe to load data from Kafka, but the support team monitoring the pipe jobs needs an alert mechanism.
How can I implement an alert mechanism for Snowpipe via email/slack/etc?
The interface provided by Snowflake between the database and surroundings is mainly with cloud storage. There is no out-of-the-box integration with messaging apart from cloud storage events.
All other integration and messaging must be provided by client solutions.
Snowflake also provides scheduled tasks that can be used for monitoring purposes, but the interface limitations are the same as described above.
Snowflake is database as a service and relies on other (external) cloud services for a complete systems solution.
This is different from installing your own copy of database software on your own compute resource, where you can install any software alongside with the database.
Please correct my understanding if anything I say is incorrect. I believe Snowpipe is great for continuous data loading but it is hard or no way to track all the errors in the source file. As mentioned in the previous suggestions, we could build a visualization querying against COPY_HISTORY and/or PIPE_USAGE HISTORY but it doesn't give you ALL THE ERRORS in the source file. It only tells you these related to the errors
PIPE_USAGE HISTORY will tell you nothing about the errors in the source file.
The only function that can be helpful (for returning all errors) is the VALIDATE table function in the Information_Schema but it only validates for COPY_INTO.
There is a similar function for PIPE called VALIDATE_PIPE_LOAD but according to the documentation it returns only the first error. Snowflake says "This function returns details about ANY errors encountered during an attempted data load into Snowflake tables." But the output column ERROR says only the first error in the source file.
So here is my question. If any of you guys have successfully Snowpipe to load in real-time production environment how are you doing the error handling and alerting mechanism?
I think as compared to Snowpipe, using COPY_INTO within a Stored Procedure and have shell script calling this Stored procedure and then scheduling this script to run using any Enterprise Scheduler like Autosys/Control-m is a much streamlined solution.
Using External functions, Stream and Task for alerting is an elegant solution maybe but again I am not sure if solves the problem of error-tracking.
Both email and Slack alerts can be implemented via external functions.
EDIT (2022-04-27): Snowflake now officially supports Error Notifications for Snowpipe (currently in Public Preview, for AWS only).
"Monitoring" & "alert mechanism" are a very broad terms. What do you want to monitor? What should be triggering the alerts? The answer can only be as good as the question, so adding more details would be helpful.
As Hans mentioned in his answer, any solution would require the use of systems external to Snowflake. However, Snowflake can be the source of the alerts by leveraging external functions or notification integrations.
Here are some options:
If you want to monitor Snowpipe's usage or performance:
You could simply hook up a BI visualization tool to Snowflake's COPY_HISTORY and/or PIPE_USAGE_HISTORY. You could also use Snowflake's own visualization tool, called Snowsight.
If you want to be alerted about data loading issues:
You could create a data test against COPY_HISTORY in DBT, and schedule it to run on a regular basis in DBT Cloud.
Alternatively, you could create a task that calls a procedure on a schedule. Your procedure would check COPY_HISTORY first, then call an external function to report failures.
Some notes about COPY_HISTORY:
Please be aware of the limitations described in the documentation (in terms of the privileges required, etc.)
Because COPY_HISTORY is an INFORMATION_SCHEMA function, it can only operate on one database at a time.
To query multiple databases at once, UNION could be used to combine the results.
COPY_HISTORY can be used for alerting only, not diagnostic. Diagnosing data load errors is another topic entirely (the VALIDATE_PIPE_LOAD function is probably a good place to start).
If you want to be immediately notified of every successful data load performed by Snowpipe:
Create an external function to send notifications/alerts to your service(s) of choice.
Create a stream on the table that Snowpipe loads into.
Add a task that runs every minute, but only when the stream contains data, and have it call your external function to send out the alerts/notifications.
EDIT: This solution does not provide alerting for errors - only for successful data loads! To send alerts for errors, see the solutions above ("If you want to be alerted about data loading issues").

Should I have to sumit jobs to spark or I can run them from client lib?

So I'm learning about Spark and I have a question about how client libs works.
My goal is to do some sort of data analysis in Spark, telling it where are the data sources (databases, cvs, etc) to process, and store results in hdfs, s3 or any kind of database like MariaDB or MongoDB.
I though about having a service (API application) that "tells" spark what I want to do. The question is: Is it enough setting the master configuration with spark:remote-host:7077 at context creation or should I send the application to spark with some sort of spark-submit command?
This completely depends on how your environment is set up, if all paths are linked to your account you should be able to run one of the two commands, to efficiently open the shell and run test commands. The reason to have a shell, is this will allow you to dynamically run commands and validate/learn how to run/tether commands onto one another and see what results come out.
Scala
spark-shell
Python
pyspark
Inside of the environment, if everything is linked to Hive tables you can check the tables by running
spark.sql("show tables").show(100,false)
The above command will run a "show tables" on the Spark-Hive-Metastore Catalogue and will return all active tables you can see (doesn't mean you can access the underlying data). The 100 means I am going to look at 100 rows and the false means to show the full string not the first N many characters.
In a mythical example if one of the tables you see is called Input_Table you can bring it into the environmrnt with the below commands
val inputDF = spark.sql("select * from Input_Table")
inputDF.count
I would heavily advise, while your learning, not to run the commands via Spark-Submit, because you will need to pass through the Class and Jar, forcing you to edit/rebuild for each testing making it difficult to figure how commands will run without have a lot of down time.

how to index data in solr from database automatically

I have MySql database for my application. i implemented solr search and used dataimporthandler(DIH)to index data from database into solr. my question is: is there any way that if database gets updated then my solr indexes automatically gets update for new data added in the database. . It means i need not to run index process manually every time data base tables changes.If yes then please tell me how can i achieve this.
I don't think there is a possibility in Solr which lets you index the data when any updates happens to DB.
But there could be possibilities like, with the help of Triggers - there is a possibility to run an external application from triggers.
Write a CRON to trigger PHP script which does reading from the DB and indexing it in Solr. Write a trigger (which calls this script) for CRUD operation and dump it into DB, so, whenever something happens to DB, this trigger will call the above script and indexing could happen.
Please see:
Invoking a PHP script from a MySQL trigger
Automatic Scheduling:
Please see this post How can I Schedule data imports in Solr for more information on scheduling. The second answer, explains how to import using Cron.
Since you used a DataImportHandler to initially load your data into Solr... You could create a Delta Import Handler that is executed using curl from a cron job to periodically add changes in the database to the index. Also, if you need more real time updates, as #Rakesh suggested, you could use a trigger in your database and have that kick off the curl call to the Delta DIH.
you can import the data using your browser and task manager.
do the following steps on windows server...
GO to Administrative tools => task Schedular
Click "Create Task"
Now a screen of Create Task will be open with the TAB
General,Triggers,Actions,Conditions,Settings.
In the genral tab enter the task name "Solrdataimport" and in discriptions enter "Import mysql data"
Now go to Triggers tab CLick new in Setting check Daily.In Advanced setting Repeat task every ... Put time there whatever you want.click OK
Now go to Actions button click new Button IN setting put Program/Script "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" this is the installation path of chrome browser.In the Add Arguments enter http://localhost:8983/solr/#/collection1/dataimport//dataimport?command=full-import&clean=true And click OK
Using the all above process Data import will Run automatically.In case of Stop the Imort process follow the all above process just change the Program/Script "taskkill" in place of "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" under Actions Tab In arguments enter "f /im chrome.exe"
Set the triggers timing according the requirements
What you're looking for is a "delta-import", and a lot of the other posts have information about that covered. I created a Windows WPF application and service to issue commands to Solr on a recurring schedule, as using CRON jobs and Task Scheduler is a bit difficult to maintain if you have a lot of cores / environments.
https://github.com/systemidx/SolrScheduler
You basically just drop in a JSON file in a specified folder and it will use a REST client to issue the commands to Solr.

How can I have my polling service call SSIS sequentially?

I have a polling service that checks a directory for new files, if there is a new file I call SSIS.
There are instances where I can't have SSIS run if another instance of SSIS is already processing another file.
How can I make SSIS run sequentially during these situations?
Note: parallel SSIS's running is fine in some circumstances, while in others not, how can I achieve both?
Note: I don't want to go into WHEN/WHY it can't run in parallel at times, but just assume sometimes it can and sometimes it can't, the main idea is how can I prevent a SSIS call IF it has to run in sequence?
If you want to control the flow sequentially, think of a design like where you can enqueue requests (for invoking SSIS) to a queue data structure. At a time, only the top request from the queue will be processed. As soon as that request completes, next request can be dequeued.

Resources