PyFlink: how to set parallelism when using SQL and Table API? - apache-flink

I have a processing topology using PyFlink and SQL where there is data skew: I'm splitting a stream of heterogenous data into separate streams based on the type of data that's in it and some of these substreams have very many more events than others and this is causing issues when checkpointing (checkpoints are timing out). I'd like to increase parallelism for these problematic streams, I'm just not sure how I do that and target just those elements. Do I need to use the datastream API here? What does this look like please?
I have a table defined and I duplicate a stream from that table, then filter so that my substream has only the events I'm interested in:
events_table = table_env.from_path(MY_SOURCE_TABLE)
filtered_table = events_table.filter(
col("event_type") == "event_of_interest"
)
table_env.create_temporary_view(MY_FILTERED_VIEW, filtered_table)
# now execute SQL on MY_FILTERED_VIEW
table_env.execute_sql(...)
The default parallelism of the overall table env is 1. Is there a way to increase the parallelism for just this stream?

Related

Can I perform transformations using Snowflake Streams?

Currently I have a snowflake table being updated from a kafka connector in near-realtime, I want to be able to then in near-real time take these new data entries through something such as snowflake cdc / snowflake streams and append some additional fields. Some of these will be to track max values within a certain time period (window function probs) and others will be to receive values from static tables based on where static_table.id = realtime_table.id.
The final goal is to perform these transformations and transfer them to a new presentation level table, so I have both a source table and a presentation level table, with little latency between the two.
Is this possible with Snowflake Streams? Or is there a combination of tools snowflake offers that can be used to achieve this goal? Due to a number of outside constraints it is important that this can be done within the snowflake infrastructure.
Any help would be much appreciated :).
I have considered the use of a materialised view, but am concerned regarding costs / latency.
The goal of Streams - together with Tasks - is to get the transformations done that you are asking for.
This is a quickstart to start growing you Stream and Tasks abilities:
https://quickstarts.snowflake.com/guide/getting_started_with_streams_and_tasks/
On the 6th step you can see a task that would transform the data as it arrives:
create or replace task REFINE_TASK
USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE = 'XSMALL'
SCHEDULE = '4 minute'
COMMENT = '2. ELT Process New Transactions in Landing/Staging Table into a more Normalized/Refined Table (flattens JSON payloads)'
when
SYSTEM$STREAM_HAS_DATA('CC_TRANS_STAGING_VIEW_STREAM')
as
insert into CC_TRANS_ALL (select
card_id, merchant_id, transaction_id, amount, currency, approved, type, timestamp
from CC_TRANS_STAGING_VIEW_STREAM);

Snowflake Task Condition: When Table Has Data

I would like to include a condition in my Snowflake task to run only if a specified table has data in it. This would be similar to task condition:
WHEN SYSTEM$STREAM_HAS_DATA('my_schema.my_table')
Except I do not wish to use a stream. The problem with using a stream in some cases, is that streams can go stale. I have tables in my ELT process that may not receive updates for weeks or months. Possibly even years.
One thought I had was to use a UDF in the task condition:
WHEN PUBLIC.TABLE_HAS_DATA('my_schema.my_table')
This would be great if I could throw a SELECT CAST(COUNT(1) AS BOOLEAN) FROM "my_schema"."my_table" in there. But a SQL UDF will not be able to do anything with a table name that is passed as a parameter. And a Javascript UDF seems too limiting when it comes to querying tables.
Admittedly, I am not a Javascript programmer. Nor am I too familiar with Snowflake's Javascript UDF abilities. I can perform the desired queries in a Javascript Stored Proc just fine. But those don't seem to translate over to UDFs.
Snowflake Streams should only go stale if you don't do something with the data within its set retention period. As long as you have a task to process data in the stream (change records) when they show up you should be fine. So if you don't see a change show up in a Stream for 6 months, that's fine as long as you process that change record within your data retention period (14 days as an example).
If your task has a STREAM_HAS_DATA condition and the stream doesn't get data for 14 days, the stream will go stale because a stream's offset is only updated when it is queried. You can work around this issue by removing the condition and letting the task run more often.
The SYSTEM$STREAM_HAS_DATA does only apply to streams https://docs.snowflake.com/en/sql-reference/functions/system_stream_has_data.html.
As streams can get stale, we can check (since Snowflake 5.1.x which was released last Jan 2021) the stale_after timestamp property returned by the SHOW STREAMS command so that we can promptly re-create streams that are about to get stale.
A solution to retrieve stale streams is provided here: Snowflake - How can I query the stream's metadata and save to table

- Unloading data from snowflake - each row to a separate file

Please share your experiences wrt unloading the data from snowflake -
The table has million rows and each row is around 16MB data.
The "The copy into '#ext_stg/path/file_name'
from schema.table" has to generate separate file for each row.
Intent is to generate million files in S3.
The "Copy into" is designed to write bulk data at once.
Using "Copy into" to generate separate files for each row is extremely slow.
Thanks!
Snowflake's COPY INTO LOCATION statement writes in the ndjson format which already makes it very simple to divide the records down with a little local processing.
It appears you've already tried doing a row-by-row iteration to perform such single row exports and have found it expectedly slow. It may still be a viable option if this is only a one-time operation.
Snowflake does not offer any parallel split and per-row export techniques (that I am aware of) so it may be simpler instead to export the entire table normally, and then use a downstream parallel processing framework (such as a Spark job) to divide the input into individual record files. The ndjson format's ready-to-be-split nature makes processing the file easy in distributed program frameworks.
P.s. Specifying the MAX_FILE_SIZE copy option to a very low value (lower than the minimum bound of your row size) will not guarantee a single file per row as the writes are done over sets of rows read together from the table.
You can achieve this through scripting using python or even with the snowflake javascript procedure.
Pseudocode would look like this:
var_filter_list = select primary_key from schema.table; -- primary key or unique identifier
for idx, pk_val in enumerate(var_filter_list): -- for each row
var_file_name = concat(file_name,idx)
copy into #ext_stg/path/var_file_name from ( select * from schema.table where
primary_key = pk );

Prevent deadlock in read-committed SELECT

I am extracting data from a business system supplied by a third party to use in reporting. I am using a single SELECT statement issued from an SSIS data flow task source component that joins across multiple tables in the source system to create the dataset I want. We are using the default read-committed isolation level.
To my surprise I regularly find this extraction query is deadlocking and being selected as the victim. I didn't think a SELECT in a read-committed transaction could do this, but according to this SO answer it is possible: Can a readcommitted isolation level ever result in a deadlock (Sql Server)?
Through the use of the trace flags 1204 and 12222 I've identified the conflicting statement, and the object and index in question. Essentially, the contention is over a data page in the primary key of one of the tables. I need to extract from this table using a join on its key (so I'm taking out an S lock), the conflicting statement is performing an INSERT and is requesting an IX lock on the index data page.
(Side note: the above SO talks about this issue occurring with non-clustered indexes, but this appears to be occurring in the clustered PK. At least, that is what I believe based on my interpretation of the deadlock information in the event log and the "associatedObjectId" property.)
Here are my constraints:
The conflicting statement is in an encrypted stored procedure supplied by a third party as part of off-the-shelf software. There is no possibility of getting the plaintext code or having it changed.
I don't want to use dirty-reads as I need my extracted data to maintain its integrity.
It's not clear to me how or if restructuring my extract query could prevent this. The lock is on the PK of the table I'm most interested in, and I can't see any alternatives to using the PK.
I don't mind my extract query being the victim as I prefer this over interrupting the operational use of the source system. However, this does cause the SSIS execution to fail, so if it must be this way I'd like a cleaner, more graceful way to handle this situation.
Can anyone suggestion ways to, preferably, prevent the deadlock, or if not, then handle the error better?
My assumption here is that you are attempting to INSERT into the same table that you are SELECTing from. If no, then a screenshot of the data flow tab would be helpful in determining the problem. If yes, then you're in luck - I have had this problem before.
Add a sort to the data flow as this is a fully blocking transformation (see below regarding blocking transformations). What this means is that the SELECT will be required to complete loading all data into the pipeline buffer before any data is allowed to pass down to the destination. Otherwise, SSIS is attempting to INSERT data while there is a lock on the table/index. You might be able to get creative with your indexing strategies here (I have not tried this). But, a fully blocking transformation will do the trick and eliminates the need for any additional indexes to the table (and the overhead that entails).
Note: never use NOLOCK query hints when selecting data from a table as an attempt to get around this. I have never tried this nor do I intend to. You (the royal you) run the risk of ingesting uncommitted data into your ETL.
Reference:
https://jorgklein.com/2008/02/28/ssis-non-blocking-semi-blocking-and-fully-blocking-components/

DBInputFormat multiple records processing

When we connect to a RDBMS like MYSQL using Hadoop we usually get a record from the DB into a user-defined class which extends DBWritable and Writable. If our SQL query generates N records as output then the act of reading a record into the user-defined class is done N times. Is there a way in which I can get more number of records into the mapper at the same time instead of 1 record each time ?
If I understand you correctly, you think Hadoop causes N SELECT statements under the hood. That is not true. As you can see in DBInputFormat's source, it creates chunks of rows based on what Hadoop deems fit.
Obviously, each mapper will have to execute a query to fetch some data for it to process, and it might do so repeatedly, but that's still definitely nowhere near the number of rows in the table.
However, if performance degrades, you might be better off just dumping the data into HDFS / Hive and processing it from there.

Resources