I want to execute Flink SQL on batch data. (CSVs in S3)
However, I explicitly want Flink to execute my query in a streaming fashion because I think it will be faster than the batch mode.
For example, my query consists of filtering on two tables and joining the filtered result. I want Flink not to materialize the two tables in blocking batch fashion and then pipe the result through the join, but use a streaming hash join operator like in the datastream API.
How do I make this happen? I am using PyFlink.
You can read at https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution_mode/ how you can set the Execution Mode for a Flink application. Combine this with https://nightlies.apache.org/flink/flink-docs-master/docs/dev/python/python_config/ which explains how you can specify configuration options in Python applications.
Related
As a part of our overall flow, data will be ingested into Azure blob from Influx DB and SQL DB, the thought process is to use Snowflake queries/SP to load the data from blob to snow flake in a scheduled manner (batch process). The thought process is to use the Tasks to schedule and orchestrate the execution using Snowflake scripting. Few questions,
Dynamic queries can be created and executed based on a config table - Ex: A copy command specifying the exact paths and file to load data from.
As a part of snowflake scripting, per understanding a sequence of steps (queries / SP) stored in a configuration DB can be executed in order along with some control mechanism.
Possibilities for sending email notifications of error records by loading into a table. whether this should be handled outside of snowflake after the data load process by using Azure data factory / logic apps.
Whether the above approach is possible and are there any limitations in using the above manner? Are there any alternate approaches that can be considered for the above.
you can dynamically generate and execute queries with a SP. You can chain activities within an SP's logic or by linked tasks running separate SPs. There is no functionality within Snowflake that will generate emails
I need to build a pipeline with different column aggregations from a same source, say, one by userId, another by productId, and etc. I also want to have different granularity aggregations, say, by hourly, daily. Each aggregation will have a different sink, say a different nosql table.
It seems simple to build a SQL query with Table API. But I would like to reduce the operation overhead of managing too many Flink apps. So I am thinking putting all different SQL queries in one pyflink app.
This is first time I build Flink app. So I am not sure how feasible this is. In particular, I'd like to know:
Read the Flink doc, I see there are concepts of application vs job. So I am curious if each SQL aggregation query is a single Flink job?
will the overall performance degraded because of too many queries in one Flink app?
since the queries share a same source(from kinesis), will each query get a copy of the source. Basically, I want to make sure each event will be processed by each sql aggregation query.
Thanks!
You can put multiple queries into one job if you use a statement set:
https://docs.ververica.com/user_guide/sql_development/sql_scripts.html
I am currently in the process of developing a pipeline using Apache Beam with Flink as an execution engine. As a part of the process I read data from Kafka and perform a bunch of transformations that involve joins, aggregations as well as lookups to an external DB.
The idea is that we want to have higher parallelism with Flink when we are performing the aggregations but eventually coalesce the data and have lesser number of processes writing to the DB so that the target DB can handle it (for example say I want to have a parallelism of 40 for aggregations but only 10 when writing to target DB).
Is there any way we could do that in Beam?
We are working on requirement where we want to fetch incremental data from one redshift cluster "row wise", process it based on requirement and insert it in another redshift cluster. We want to do it "row wise" not "batch operation." For that we are writing one generic service which will do row processing from Redshift -> Redshift. So, it is like Redshift -> Service -> Redshift.
For inserting data, we will use insert queries to insert. We will commit after particular batch not row wise for performance.
But I am bit worried about performance of multiple insert queries. Or is there any other tool available which does it. There are many ETL tools available but all do batch processing. We want to process row wise. Can someone please suggest on it?
I can guarantee that your approach will not be efficient based on experience. You can refer this link for detailed best practices :
https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
But, I would suggest that you do as follows :
Write a python script to unload the data from your source Redshift to S3 based on a query condition that filters data as per your requirement, i.e based on some threshold like time, date etc. This operation should be fast and you can schedule this script to execute every minute or in a couple of minutes, generating multiple files.
Now, you basically have a continuous stream of files in S3, where the size of each file or batch size can be controlled based on your frequency for the previous script.
Now, all you have to do is set up a service that keeps polling S3 for objects/files as and when they are created and then process them as needed and put the processed file in another bucket. Let's call this as B2.
Set up another python script/ETL step that remotely executes a COPY command from bucket B2.
This is just an initial idea though. You have to evolve on this approach and optimize this. Best of luck!
Using gcloud-node is it possible to query in batches (multiple queries in 1 network call)? I know it is possible to get and delete in batches, but can I do the same with queries somehow?
No, Cloud Datastore doesn't support query batches, so none of the client libraries can do this.
The alternative it to send queries in parallel instead.