flink checkpoint stuck on bad file - apache-flink

I am new to flink (1.3.2) and I have a question and want to see if anyone can help here.
So we have a s3 path that flink is monitoring that path to see new files available.
val avroInputStream_activity = env.readFile(format, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 10000)
I am doing both internal and external check pointing and let's say there is a bad file came to the path and flink will do several retries. I want to take those bad files to some error folders and let the process continue. However, since the file path persisted in the checkpoint, when I tried to resume from external checkpoint (I removed the bad file), it threw the following error on no file been found.
java.io.IOException: Error opening the Input Split s3a://myfile [0,904]: No such file or directory: s3a://myfile
I have two questions here:
How do people handle exceptions like bad file or records.
Is there a way to skip this bad file and move on from checkpoint?
Thanks in advance.

Best practice is to keep your job running by catching any exceptions, such as those caused by bad input data. You can then use side outputs to create an output stream containing only the bad records. For example, you might send them to a bucketing file sink for further analysis.

Related

Snowpipe Issue - Azure data lake storage

We're running into an issue where snowpipe is probably starting to ingest the file even before it gets fully written in azure data lake storage.
It then throws an error - Error parsing the parquet file: Invalid: Parquet file size is 0 bytes.
Here are some stats that show that file was fully written at 13:59:56 and snowflake was notified at 13:59:47.
PIPE_RECEIVED_TIME - 2021-08-06 13:59:47.613 -0700
LAST_LOAD_TIME - 2021-08-06 14:00:05.859 -0700
ADLS file last modified time - 13:59:56
Has anyone run into this issue or have any pointers for troubleshooting this?
I have seen something similar once. I was trying to funnel Azure Logs into a storage account and have them picked up. However, the built in process that wrote the logs would create a file, gradually append updates to it with new logs, and then every hour or so, cut over to a new file for more logs.
the Snowpipe would pick up the file with one log (or none) and from there, the azure queue would no longer send another event for that file so Snowflake would never query it again to process it.
So I'm wondering if your process is creating the file and then updating it. Rather than creating it with the output already fully ready to write.
If this is the issue, and you don't have control of how the file is created. you could try use a task that runs COPY INTO on a schedule (rather than a snowpipe) so that you can restrict the list of files getting copied to just files that have finished writing fully.

SSIS: Why does sequence container require TransactionOption Required in order to report task failure?

I have a job that creates files on the network folder from various non-database sources. Within the job, I isolate the various file creation tasks (contained in a sequence container) from the move file task (foreach enumerator) in order to prevent a spiders web of precedence constraints from the various file creation tasks:
Data flow task that contains script component using C# and LDAP to pull data from Active Directory and output it to multiple files
Script Component that downloads files from SFTP (implements WinSCPNET.dll)
Upon successful completion, the sequence container then goes to a foreach file enumerator to move the extracted files to a folder that indicates files are ready for loading - there is no problem here.
However, an intermittent problem arose in production where the AD connection was terminating before the file extract process completed, resulting in partial files (this was not observed in testing, but should have been contemplated - my bad). So, I added a foreach enumerator outside of the sequence container with a failure precedence constraint to delete these partial extract files.
During testing of this fix, I set one of the tasks within the sequence container to report failure. Initially, the sequence container reported success, thus bypassing the delete foreach enumerator. I tried setting the MaximumErrorCount from 0 to 1, but that did not result in the desired behavior change. I then changed the sequence container's TransactionOption from supported to required and this appears to have fixed the problem. Now, the job moves files that are completely extracted while deleting files that report and error on extraction.
My question is this: Is there a potential problem going this route? I am unsure as to why this solution works. The documentation online discusses the TransactionOption in the context of a connection to the database. But, in this case there is no connection to the database. I just don't want to release a patch that may have a potential bug that I am not aware of.
Regarding Transactions and Files.
Presume you write your files to disk with an NTFS or another file system supporting transactions. Then all file create and file save actions are enclosed into one transaction. Had the transaction failed due to task failure, all the files created inside the transaction will be rolled back, i.e. deleted.
So, you will have an "all or nothing" approach on files, receiving files only if all extractions worked out.
In case you store the files on non-transactional file system, like old FAT, this "all or nothing" will no longer work and you will receive partial set of files. Transaction set on Sequence will have no such effect.

How does logging in managed VMs work?

I'm reading Google's docs on logging in managed VMs, and they're rather thin on detail, and I have more questions than answers after reading:
Files in /var/log/app_engine/custom_logs are picked up automatically it says – is this path pre-existing or do you also have to mkdir -p it?
Do I have to deal with log rotation/truncation myself?
How large can the files be?
If you write a file ending in .log.json and some bit of it is corrupt, does that break the whole file or will Google pick up the bits that can be read?
Is there a performance benefit/cost to log things this way, over using APIs?
UPDATE: I managed to have logs show up in the log viewer, but only when logging files with the .log suffix, whenever I try .log.json they are not being picked up and I can't see any errors anywhere. The JSON output seems fine, and conforms to the requirement of having one object per line. Does anyone know how to debug this?

Handling file access locks while file is being built

I have a [SQL 2008] SSIS package that takes a CSV text file and moves it to a separate folder. Once it is in this folder, I import the data to SQL. The text file is being automatically generated by an outside program on a periodic schedule. The file is also pretty large, so it takes a while (~10 minutes) for it to be generated.
If I attempt to move this file (using a File System Task) WHILE the file is still being built, I get this error message:
"The process cannot access the file because it is being used by another process."
Which makes sense, since it can't move a file that is being accessed elsewhere. Back in DTS I wrote some custom script to check for a period of XX seconds to see if the file size had increased, but I was wondering how to handle this properly in SSIS. Surely there is a better way to determine if a file has locks on it before doing file operations.
I would greatly appreciate any suggestions or comments! Thank you.
Probably, you have found an answer to your question by now. This is for others who might stumble upon this question.
To achieve the functionality that you have described in your question, you can use the File Watcher Task that is available for free download from the website SQLIS.com.Click the link to visit File Watcher Task download page.
Hope that helps.

JMS File Store tuning

We are using Weblogic JMS as the JMS provider for our application. We use file store as the persistent store. Is there any mechanism to condfigure the file store size so that after the file has reached the specified size, a new file is generated. Right now I have seen that all the messages till today are persisted into one single file. I think it will affect the IO operation to read/write messages as the size increase. Has anyone come across this situation and what is the solution?
Here are the details about WebLogic JMS file store tuning.
Maybe this info about unit-of-order processing will be helpful as well.
I can't see anything about file size, but perhaps the message paging info will be helpful.
When I've used JMS with persistent messaging, I haven't had a problem with file size. Once a message is transacted, what happens to its entry in the file? Shouldn't it be removed or marked as acknowledged? Surely the file size doesn't grow without bound.

Resources