Process past data in multiple window operators in Apache Flink? - apache-flink

Context: the project I'm working on processes timestamped files that are produced periodically (1 min) and they are ingested in real time into a series of cascading window operators. The timestamp of the file indicates the event time, so I don't need to rely on the file creation time. The result of the processing of each window is sent to a sink which stores the data in several tables.
input -> 1 min -> 5 min -> 15 min -> ...
\-> SQL \-> SQL \-> SQL
I am trying to come up with a solution to deal with possible downtime of the real time process. The input files are generated independently, so in case of severe downtime of the Flink solution, I want to ingest and process the missed files as if they were ingested by the same process.
My first thought is to configure a mode of operation of the same flow which reads only the missed files and has an allowed lateness which covers the earliest file to be processed. However, once a file has been processed, it is guaranteed that no more late files will be ingested, so I don't necessarily need to maintain the earliest window open for the duration of the whole process, especially since there may be many files to process in this manner. Is it possible to do something about closing windows, even with the allowed lateness set, or maybe I should look into reading the whole thing as a batch operation and partition by timestamp instead?

Since you are ingesting the input files in order, using event time processing, I don't see why there's an issue. When the Flink job recovers, it seems like it should be able to resume from where it left off.
If I've misunderstood the situation, and you sometimes need to go back and process (or reprocess) a file from some point in the past, one way to do this would be to deploy another instance of the same job, configured to only ingest the file(s) needing to be (re)ingested. There shouldn't be any need to rewrite this as a batch job -- most streaming jobs can be run on bounded inputs. And with event time processing, this backfill job will produce the same results as if it had been run in (near) real-time.

Related

How to handle Sagemaker Batch Transform discarding a file with a failed model request

I have a large number of JSON requests for a model split across multiple files in an S3 bucket. I would like to use Sagemaker's Batch Transform feature to process all of these requests (I have done a couple of test runs using small amounts of data and the transform job succeeds). My main issue is here (https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html#batch-transform-errors), specifically:
If a batch transform job fails to process an input file because of a problem with the dataset, SageMaker marks the job as failed. If an input file contains a bad record, the transform job doesn't create an output file for that input file because doing so prevents it from maintaining the same order in the transformed data as in the input file. When your dataset has multiple input files, a transform job continues to process input files even if it fails to process one. The processed files still generate useable results.
This is not preferable mainly because if 1 request fails (whether its a transient error, a malformmated request, or something wrong with the model container) in a file with a large number of requests, all of those requests will get discarded (even if all of them succeeded and the last one failed). I would ideally prefer Sagemaker to just write the output of the failed response to the file and keep going, rather than discarding the entire file.
My question is, are there any suggestions to mitigating this issue? I was thinking about storing 1 request per file in S3, but this seems somewhat ridiculous? Even if I did this, is there a good way of seeing which requests specifically failed after the transform job finishes?
You've got the right idea: the fewer datapoints are in each file, the less likely a given file is to fail. The issue is that while you can pass a prefix with many files to CreateTransformJob, partitioning one datapoint per file at least requires an S3 read per datapoint, plus a model invocation per datapoint, which is probably not great. Be aware also that apparently there are hidden rate limits.
Here are a couple options:
Partition into small-ish files, and plan on failures being rare. Hopefully, not many of your datapoints would actually fail. If you partition your dataset into e.g. 100 files, then a single failure only requires reprocessing 1% of your data. Note that Sagemaker has built-in retries, too, so most of the time failures should be caused by your data/logic, not randomness on Sagemaker's side.
Deal with failures directly in your model. The same doc you quoted in your question also says:
If you are using your own algorithms, you can use placeholder text, such as ERROR, when the algorithm finds a bad record in an input file. For example, if the last record in a dataset is bad, the algorithm places the placeholder text for that record in the output file.
Note that the reason Batch Transform does this whole-file failure is to maintain a 1-1 mapping between rows in the input and the output. If you can substitute the output for failed datapoints with an error message from inside your model, without actually causing the model itself to fail processing, Batch Transform will be happy.

Are there best practices for deduplicating records when using auto-ingest Snowpipes?

Currently in Snowflake we have configured an auto-ingest Snowpipe connected to an external S3 stage as documented here. This works well and we're copying records from the pipe into a "landing" table. The end goal is to MERGE these records into a final table to deal with any duplicates, which also works well. My question is around how best to safely perform this MERGE without missing any records? At the moment, we are performing a single data extraction job per-day so there is normally a point where the Snowpipe queue is empty which we use as an indicator that it is safe to proceed, however we are looking to move to more frequent extractions where it will become harder and harder to guarantee there will be no new records ingested at any given point.
Things we've considered:
Temporarily pause the pipe, MERGE the records, TRUNCATE the landing table, then unpause the pipe. I believe this should technically work but it is not clear to me that this is an advised way to work with Snowpipes. I'm not sure how resilient they are to being paused/unpaused, how long it tends to take to pause/unpause, etc. I am aware that paused pipes can become "stale" after 14 days (link) however we're talking about pausing it for a few minutes, not multiple days.
Utilize transactions in some way. I have a general understanding of SQL transactions, but I'm having a hard time determining exactly if/how they could be used in this situation to guarantee no data loss. The general thought is if the MERGE and DELETE could be contained in a transaction it may provide a safe way to process the incoming data throughout the day but I'm not sure if that's true.
Add in a third "processing" table and a task to swap the landing table with the processing table. The task to swap the tables could run on a schedule (e.g. every hour), and I believe the key is to have the conditional statement check both that there are records in the landing table AND that the processing table is empty. As this point the MERGE and TRUNCATE would work off the processing table and the landing table would continue to receive the incoming records.
Any additional insights into these options or completely different suggestions are very welcome.
Look into table streams which record insertions/updates/deletions to your snowpipe table. You can then merge off the stream to your target table which then resets the offset. Use a task to run your merge statement. Also, given it is snowpipe, when creating your stream it is probably best to use an append only stream
However, I had a question here where in some circumstances, we were missing some rows. Our task was set to 1min intervals, which may be partly the reason. However I never did get to the end of it, even with Snowflake support.
What we did notice though was that using a stored procedure, with a transaction and also running a select on the stream before the merge, seems to have solved the issue i.e. no more missing rows

Order of operations between timewindow to sink pipeline

Suppose I have a flink pipeline as such:
kafka_source -> maps/filters/keyBy/timewindow(1 minute) -> sinkCassandra
By the time the grouped messages hit the sinkCassandra operation, am I guaranteed that no other slots won't also concurrently run the maps/filters/keyBy/timewindow(1 minute) part of the pipeline?
Or is it possible to have some other slot run the middle pipeline while another set is running the sinkCassandra operation?
EDIT ( Added more requirements based on comment conversation ):
What I'm trying to do is effectively do a lookup based on flink data key from the datastore, and do an update and flush the updated data back.
The reason why I'm dodging using kafka_source -> maps/filters -> keyBy/TimeWindow/statefulReduce -> sinkCassandra is because the state can potentially get huge ( 1 day to 7 days where I can place 7 days as the max time bounding ) and I don't necessarily know the time window for each key. This would mean a HUGE state even with rocksdb.
Another potential option that I'm looking at is kafka_source -> maps/filters -> keyBy/sinkCass where within the custom sink operation, I would first check in some sort of in-memory buffer if I have the key that I want to update. If not, I go ahead and fetch from Cassandra. Every 5 seconds ( or every N seconds ), I would grab whatever's in the buffer and flush into Cassandra. To limit memory, I can do an in-memory least recently used hashmap ( I don't necessarily want to flush b/c multiple keys will show up again! )
Unless you have explicitly configured something unusual, each slot will contain one parallel slice of the complete pipeline -- each slot will have a kafka source instance connected to a disjoint subset of the kafka partitions, as well as the maps/filters/keyBy/window, and the cassandra sink.
All of those parallel sub-pipelines (slots) will be running concurrently. Furthermore, within each slot, each of the operators will also be running concurrently. The sink and the middle part of your pipeline are already running concurrently, but they are competing for the resources of the slot that contains them both. You can configure your task managers to have more cores per slot if you are concerned about starvation.
EDiT (responding to add'l info about requirements):
You can safely assume that for any given flink data key, after a keyBy, only one instance of each operator will process events for that key. That principle is fundamental to Flink's design. If I understand correctly what you are contemplating, that's the only guarantee you need.

How to execute functions on empty windows in Flink streaming?

I wrote a Flink program that calculates the number of events per keyed window from a simple kafka stream. I works great, fast & accurate. When the source stops, I would like to have 0 as result of the calculation on each window, but no result is sent. The function just does not execute. I assume this is because of the lazy operation behavior of Flink.
Any recommendation?
I encountered the same situation. Filling the holes in your database with another process is a solution.
However, I found it easier to union your main stream with a custom periodical source, that emits dummies, whose only roles are to trigger windows creation. When doing this, you have to make sure that dummies are ignored in computations.
Here is how to code a periodical source (however you may not need a RichParallelSourceFunction, a SourceFunction can be enough)

SSIS processing large amount of flat files is painfully slow

From one of our partners, I receive about 10.000 small tab delimited text files with +/- 30 records in each file. It is impossible for them to deliver it in one big file.
I process these files in a ForEach loop container. After reading a file, 4 column derivations are performed and then finally contents are stored in a SQL Server 2012 table.
This process can take up to two hours.
I already tried processing the small files into one big file and then importing this one in the same table. This process takes even more time.
Does anyone have any suggestions to speed up processing?
One thing that sounds counter intuitive is to replace your one Derived Column Transformation with 4 and have each one perform a single task. The reason this can provide performance improvement is that the engine can better parallelize operations if it can determine that these changes are independent.
Investigation: Can different combinations of components affect Dataflow performance?
Increasing Throughput of Pipelines by Splitting Synchronous Transformations into Multiple Tasks
You might be running into network latency since you are referencing files on a remote server. Perhaps you can improve performance by copying those remote files to the local box before you being processing. The performance counters you'd be interested in are
Network Interface / Current Bandwidth
Network Interface / Bytes Total / sec
Network Interface / Transfers/sec
The other thing you can do is replace your destination and derived column with a Row Count transformation. Run the package a few times for all the files and that will determine your theoretical maximum speed. You won't be able to go any faster than that. Then add in your Derived column and re-run. That should help you understand whether the drop in performance is due to the destination, the derived column operation or the package is running as fast as the IO subsystem can go.
Do your files offer an easy way (i.e. their names) of subdividing them into even (or mostly even) groups? If so, you could run your loads in parallel.
For example, let's say you could divide them into 4 groups of 2,500 files each.
Create a Foreach Loop container for each group.
For your destination for each group, write your records to their own staging table.
Combine all recordss from all staging tables into your big table at the end.
If the files themselves don't offer an easy way to group them, consider pushing them into subfolders when your partner sends them over, or inserting the file paths into a database so you can write a query to subdivide them and use the file path field as a variable in the Data Flow task.

Resources