Inconsistent results when joining multiple tables in Flink

Inconsistent results when joining multiple tables in Flink - apache-flink

We've 4 CDC sources defined of which we need to combine the data into one result table. We're creating a table for each source using the SQL API, eg:
"CREATE TABLE IF NOT EXISTS PAA31 (\n" +
" WRK_SDL_DEF_NO STRING,\n" +
" HTR_FROM_DT BIGINT,\n" +
...
" update_time TIMESTAMP(3) METADATA FROM 'value.source.timestamp' VIRTUAL,\n" +
" PRIMARY KEY (WRK_SDL_DEF_NO) NOT ENFORCED,\n" +
" WATERMARK FOR update_time AS update_time\n" +
") WITH ('value.format' = 'debezium-json' ... )";
After we've defined each table, we create a new table by running the following query:
"SELECT PAA30.WRK_SDL_DEF_NO as id,\n" +
" PAA33.DSB_TX as description,\n" +
...
"FROM PAA30\n" +
"INNER JOIN PAA33 ON PAA30.WRK_SDL_DEF_NO = PAA33.WRK_SDL_DEF_NO AND PAA33.LGG_CD = 'NL' \n" +
"INNER JOIN PAA31 ON PAA30.WRK_SDL_DEF_NO = PAA31.WRK_SDL_DEF_NO \n" +
"INNER JOIN PAA32 ON PAA30.WRK_SDL_DEF_NO = PAA32.WRK_SDL_DEF_NO";
Note some rows have been left out for formatting reasons.
The issue we're running into is that executing this exact job results in inconsistent outcomes where sometimes we have 1750 resulting rows (correct), however most of the times the resulting rows is less and random.
This is the plan overview for the job in Flink. The amount of records sent from the sources are all correct, however the amount of records sent of the 1st join statement is not:
Flink Job Execution Plan and numbers
What could be the cause and how can we have consistent joining of all data sources?

I see that your pipeline includes an event time window with a processing time trigger, and does watermarking with zero tolerance for out-of-order events. These could be causing problems.
Flink can only produce completely correct, deterministic results for streaming workloads that involve event time logic if there are no late events. Late events can occur whenever processing time logic interferes with the watermarking, e.g.,
if the watermark generator is incorrectly configured, and doesn't account for the actual out-of-orderness
if idleness detection is used, and an inactive stream becomes re-activated
after a restart (or recovery, or rescaling) occurs
Just guessing, however. Would need to see more details to give a more informed answer. A minimal, reproducible example would be ideal.
Update:
It's also the case the streaming jobs won't emit their last set of results unless something is done to provoke them to do so. In this case you could, for example, use
./bin/flink stop $JOB_ID --drain --savepointPath /tmp/flink-savepoints
to force a large watermark to be emitted that will close the last window.
Update 2:
Regular joins don't produce results with time attributes or watermarks. This is because it's impossible to guarantee that the results will be emitted in any particular order, so meaningful watermarking isn't possible. Normally it's not possible to apply event time windowing after such a join.
Update 3:
Having now studied the latest code, this obviously doesn't have anything to do with Watermarks.
If I understand correctly, the issue is that while the results always include what should be produced, there are varying numbers of additional output records. I can suggest two possible causes:
(1) When Flink is used with Debezium server there's the possibility of duplicate events. I don't think this is the explanation, but it is something to be aware of.
(2) The result of the join is non-deterministic (it varies from run to run). This is happening because the various input streams are racing against each other, and the exact order in which related events from different streams are ingested is affecting how the results are produced.
The result of the join is a changelog stream. I suspect that when the results are perfect, no retractions occurred, while in the other cases some preliminary results are produced that are later updated.
If you examine the ROW_KIND information in the output stream you should be able to confirm if this guess is correct.
I'm not very familiar with the pulsar connector, but I'm guessing you should be using the upsert_pulsar sink.

We've been able to get consistent results, even for bigger datasets, by enabling MiniBatch Aggregation
configuration.setString("table.exec.mini-batch.enabled", "true");
configuration.setString("table.exec.mini-batch.allow-latency", "500 ms");
configuration.setString("table.exec.mini-batch.size", "5000");
This seems to fix the consistency issue for both the local filesystem connector as well as for the Flink Pulsar connector.
From these findings, it seems Flink was having issues with the overhead of state management for our throughput. We'll still need to assess realistic CDC initial load processing, but so far enabling MiniBatch Aggregation seems promising
Thanks #david-anderson for thinking with us and trying to figure this out.

Related

Query compilation and provisioning times

What does it mean there is a longer time for COMPILATION_TIME, QUEUED_PROVISIONING_TIME or both more than usual?
I have a query runs every couple of minutes and it usually takes less than 200 milliseconds for compilation and 0 for provisioning. There are 2 instances in the last couple of days the values are more than 4000 for compilation and more than 100000 for provisioning.
Is that mean warehouse was being resumed and there was a hiccup?

COMPILATION_TIME:
The SQL is parsed and simplified, and the tables meta data is loaded. Thus a compile for select a,b,c from table_name will be fractally faster than select * from table_name because the meta data is not needed from every partition to know the final shape.
Super fragmented tables, can give poor compile performance as there is more meta data to load. Fragmentation comes from many small writes/deletes/updates.
Doing very large INSERT statements can give horrible compile performance. We did a lift-and-shift and did all data loading via INSERT, just avoid..
PRIOVISIONING_TIME is the amount of time to setup the hardware, this occurs for two main reasons ,you are turning on 3X, 4X, 5X, 6X servers and it can take minutes just to allocate those volume of servers.
Or there is failure, sometime around releases there can be a little instability, where a query fails on the "new" release, and query is rolled back to older instances, which you would see in the profile as 1, 1001. But sometimes there has been problems in the provisioning infrastructure (I not seen it for a few years, but am not monitoring for it presently).
But I would think you will mostly see this on a on going basis for the first reason.

The compilation process involves query parsing, semantic checks, query rewrite components, reading object metadata, table pruning, evaluating certain heuristics such as filter push-downs, plan generations based upon the cost-based optimization, etc., which totally accounts for the COMPILATION_TIME.
QUEUED_PROVISIONING_TIME refers to Time (in milliseconds) spent in the warehouse queue, waiting for the warehouse compute resources to provision, due to warehouse creation, resume, or resize.
https://docs.snowflake.com/en/sql-reference/functions/query_history.html
To understand the reason behind the query taking long time recently in detail, the query ID needs to be analysed. You can raise a support case to Snowflake support with the problematic query ID to have the details checked.

Flink batch join performance

I've been testing a simple join with both TableApi and DataStream api in batch mode. However i've been getting pretty bad results, so it must be i'm doing something wrong. Datasets used for joining are ~900gb and 3gb. Environment used for testing is EMR with 10 * m5.xlarge worker nodes.
TableApi approach used is creating a tables over data s3 paths and performing insert into statement to a created table over destination s3 path. With tweaking task manager memory, numberOfTaskSlots, parallelism but couldn't make it perform in somewhat acceptable time ( 1.5h at least ).
When using DataStreamApi in batch mode i always encounter a problem where yarn kills task due to it using over 90% of disk space. So i'm confused if that's due to the code, or just flink needs much more disk space than spark does.
Reading in datastreams:
val sourceStream: DataStream[SourceObj] = env.fromSource(source, WatermarkStrategy.noWatermarks(), "someSourceName")
.map(x => SourceObj.convertFromString(x))
Joining:
val joinedStream = sourceStream.join(sourceStream2)
.where(col1 => sourceStream.col1)
.equalTo(col2 => sourceStream2.col2)
.window(GlobalWindows.create())
.trigger(CountTrigger.of(1))
.apply{
(s, c) => JoinedObj(c.col1, s.col2, s.col3)
}
Am I missing something or i just need to scale up the cluster?

In general you're better off implementing relational workloads with Flink's Table/SQL API, so that its optimizer has a chance to help out.
But if I'm reading this correctly,
this particular join is going to be quite expensive to execute because nothing is ever expired from state. Both tables will be fully materialized within Flink, because for this query, every row of input remains relevant and could affect the result.
If you can convert this into some sort of join with a temporal constraint that can be used by the optimizer to free up rows that are no longer useful, then it will be much better behaved.

When you are using DataStream API in Batch mode it is extensively using managed memory in all shuflle/join/reduce operations. Also, as stated in the last paragraph here Flink will spill to disk all the data that it cannot fit in memory during join.
So, I assume this may be the reason for a disk space shortage issue. I had faced the same problem with my job.

Are there best practices for deduplicating records when using auto-ingest Snowpipes?

Currently in Snowflake we have configured an auto-ingest Snowpipe connected to an external S3 stage as documented here. This works well and we're copying records from the pipe into a "landing" table. The end goal is to MERGE these records into a final table to deal with any duplicates, which also works well. My question is around how best to safely perform this MERGE without missing any records? At the moment, we are performing a single data extraction job per-day so there is normally a point where the Snowpipe queue is empty which we use as an indicator that it is safe to proceed, however we are looking to move to more frequent extractions where it will become harder and harder to guarantee there will be no new records ingested at any given point.
Things we've considered:
Temporarily pause the pipe, MERGE the records, TRUNCATE the landing table, then unpause the pipe. I believe this should technically work but it is not clear to me that this is an advised way to work with Snowpipes. I'm not sure how resilient they are to being paused/unpaused, how long it tends to take to pause/unpause, etc. I am aware that paused pipes can become "stale" after 14 days (link) however we're talking about pausing it for a few minutes, not multiple days.
Utilize transactions in some way. I have a general understanding of SQL transactions, but I'm having a hard time determining exactly if/how they could be used in this situation to guarantee no data loss. The general thought is if the MERGE and DELETE could be contained in a transaction it may provide a safe way to process the incoming data throughout the day but I'm not sure if that's true.
Add in a third "processing" table and a task to swap the landing table with the processing table. The task to swap the tables could run on a schedule (e.g. every hour), and I believe the key is to have the conditional statement check both that there are records in the landing table AND that the processing table is empty. As this point the MERGE and TRUNCATE would work off the processing table and the landing table would continue to receive the incoming records.
Any additional insights into these options or completely different suggestions are very welcome.

Look into table streams which record insertions/updates/deletions to your snowpipe table. You can then merge off the stream to your target table which then resets the offset. Use a task to run your merge statement. Also, given it is snowpipe, when creating your stream it is probably best to use an append only stream
However, I had a question here where in some circumstances, we were missing some rows. Our task was set to 1min intervals, which may be partly the reason. However I never did get to the end of it, even with Snowflake support.
What we did notice though was that using a stored procedure, with a transaction and also running a select on the stream before the merge, seems to have solved the issue i.e. no more missing rows

Is there a way to clear Postgres stats xact_commit and xact_rollback only?

I am working with a postgres database that is being monitored by icinga2, and one of our monitors is looking at the commit ratio of a database:
select
round(100.*sd.xact_commit/(sd.xact_commit+sd.xact_rollback), 2) AS dcommitratio,
d.datname,
r.rolname AS rolname
FROM pg_stat_database sd
JOIN pg_database d ON (d.oid=sd.datid)
JOIN pg_roles r ON (r.oid=d.datdba)
WHERE sd.xact_commit+sd.xact_rollback<>0;
The problem is that an application recently had a bug (now fixed!) that increased the count of rollbacks considerably, so that the commit ratio is now only 78%, and it is triggering alarms every day.
I could run pg_stats_clear(), but is there a way to clear out these two counters only? I don't want to clear out any other necessary stats inadvertently, like any being used by the autovaccuum or the query optimizer. Or, is pg_stats_clear() considered safe to run?

Unfortunately it is all-or-nothing with resetting PostgreSQL statistics.
But I'd say that your monitoring system is monitoring the wrong thing anyway. Rather than monitoring the absolute values of xact_commit and xact_rollback, you should monitor the changes in the values since the last check.
Otherwise you will not detect a potential problem in a timely fashion: if there have been many months of normal operation, it will take a long time of misbehavior to change the ratio perceptibly.

SQL Server & update (or insert) parallelism

I got a large conversion job- 299Gb of JPEG images, already in the database, into thumbnail equivalents for reporting and bandwidth purposes.
I've written a thread safe SQLCLR function to do the business of re-sampling the images, lovely job.
Problem is, when I execute it in an UPDATE statement (from the PhotoData field to the ThumbData field), this executes linearly to prevent race conditions, using only one processor to resample the images.
So, how would I best utilise the 12 cores and phat raid setup this database machine has? Is it to use a subquery in the FROM clause of the update statement? Is this all that is required to enable parallelism on this kind of operation?
Anyway the operation is split into batches, around 4000 images per batch (in a windowed query of about 391k images), this machine has plenty of resources to burn.

Please check the configuration setting for Maximum Degree of Parallelism (MAXDOP) on your SQL Server. You can also set the value of MAXDOP.
This link might be useful to you http://www.mssqltips.com/tip.asp?tip=1047
cheers

Could you not split the query into batches, and execute each batch separately on a separate connection? SQL server only uses parallelism in a query when it feels like it, and although you can stop it, or even encourage it (a little) by changing the cost threshold for parallelism option to O, but I think its pretty hit and miss.
One thing thats worth noting is that it will only decide whether or not to use parallelism at the time that the query is compiled. Also, if the query is compiled at a time when the CPU load is higher, SQL server is less likely to consider parallelism.

I too recommend the "round-robin" methodology advocated by kragen2uk and onupdatecascade (I'm voting them up). I know I've read something irritating about CLR routines and SQL paralellism, but I forget what it was just now... but I think they don't play well together.
The bit I've done in the past on similar tasks it to set up a table listing each batch of work to be done. For each connection you fire up, it goes to this table, gest the next batch, marks it as being processed, processes it, updates it as Done, and repeats. This allows you to gauge performance, manage scaling, allow stops and restarts without having to start over, and gives you something to show how complete the task is (let alone show that it's actually doing anything).

Find some criteria to break the set into distinct sub-sets of rows (1-100, 101-200, whatever) and then call your update statement from multiple connections at the same time, where each connection handles one subset of rows in the table. All the connections should run in parallel.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight