Snowpipe Issue - Azure data lake storage - snowflake-cloud-data-platform

We're running into an issue where snowpipe is probably starting to ingest the file even before it gets fully written in azure data lake storage.
It then throws an error - Error parsing the parquet file: Invalid: Parquet file size is 0 bytes.
Here are some stats that show that file was fully written at 13:59:56 and snowflake was notified at 13:59:47.
PIPE_RECEIVED_TIME - 2021-08-06 13:59:47.613 -0700
LAST_LOAD_TIME - 2021-08-06 14:00:05.859 -0700
ADLS file last modified time - 13:59:56
Has anyone run into this issue or have any pointers for troubleshooting this?

I have seen something similar once. I was trying to funnel Azure Logs into a storage account and have them picked up. However, the built in process that wrote the logs would create a file, gradually append updates to it with new logs, and then every hour or so, cut over to a new file for more logs.
the Snowpipe would pick up the file with one log (or none) and from there, the azure queue would no longer send another event for that file so Snowflake would never query it again to process it.
So I'm wondering if your process is creating the file and then updating it. Rather than creating it with the output already fully ready to write.
If this is the issue, and you don't have control of how the file is created. you could try use a task that runs COPY INTO on a schedule (rather than a snowpipe) so that you can restrict the list of files getting copied to just files that have finished writing fully.

Related

StreamingFileSink fails to start if an MPU is missing in S3

We are using StreamingFileSink in Flink 1.11 (AWS KDA) to write data from Kafka to S3.
Sometimes, even after a proper stopping of the application it will fail to start with:
com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.
By looking at the code I can see that files are moved from in-progress to pending during a checkpoint: files are synced to S3 as MPU uploads and a _tmp_ objects when upload part is too small.
However, pending files are committed during notifyCheckpointComplete, after the checkpoint is done.
streamingFileSink will fail with the error above when an MPU which it has in state does not exists in S3.
Would the following scenario be possible:
Checkpoint is taken and files are transitioned into pending state.
notifyCheckpointComplete is called and it starts to complete the MPUs
Application is suddenly killed or even just stopped as part of a shutdown.
Checkpointed state still has information about MPUs, but if you try to restore from it it's not going to find them, because they were completed outside of checkpoint and not part of the state.
Would it better to ignore missing MPUs and _tmp_ files? Or make it an option?
This way the above situation would not happen and it would allow to restore from arbitrary checkpoints/savepoints.
The Streaming File Sink in Flink has been superseded by the new File Sink implementation since Flink 1.12. This is using the Unified Sink API for both batch and streaming use cases. I don't think the problem you've listed here would occur when using this implementation.

Snowpipe: Importing historical data that may be modified without causing re-imports

To start with, I'm not sure if this is possible with the existing features of Snowpipe.
I have a S3 bucket with years of data, and occasionally some of those files get updated (the contents change, but the file name stays the same). I was hoping to use Snowpipe to import these files into Snowflake, as the "we won't reimport files that have been modified" aspect is appealing to me.
However, I discovered that ALTER PIPE ... REFRESH can only be used to import files staged no earlier than seven days ago, and the only other recommendation Snowflake's documentation has for importing historical data is to use COPY INTO .... However, if I use that, then if those old files get modified, they get imported via Snowflake since the metadata preventing COPY INTO ... from re-importing the S3 files and the metadata for Snowpipe are different, so I can end up with that file imported twice.
Is there any approach, short of "modify all those files in S3 so they have a recent modified-at timestamp", that would let me use Snowpipe with this?
If you're not opposed to a scripting solution for this, one solution would be to write a script to pull the set of in scope object names from AWS S3 and feed them to the Snowpipes REST API. The code you'd use for this is very similar to what is required if you're using an AWS Lambda to call the Snowpipe REST API when triggered via an S3 event notification. You can either use the AWS SDK to get the set of objects from S3, or just use Snowflake's LIST STAGE statement to pull them.
I've used this approach multiple times to backfill historical data from an AWS S3 location where we've enabled Snowpipe ingestion after data had already been written there. Even in the scenario where you don't have to worry about a file being updated in place, this can still be an advantage over just falling back to a direct COPY INTO because you don't have to worry if there's any overlap between when the PIPE was first enabled and the set of files you push to the Snowpipe REST API since the PIPE load history will take care of that for you..

Automatically retain latest datastore backup

I'm looking for the best strategy to collect specific datastore *.backup_info files stored in Cloud Storage and copy them as the "latest" backup_info files per kind, so I have a fix location for each kind, where the most recent backup_info file is found, e.g.
gs://MY-PROJECT.appspot.com/latest/Comment.backup_info
Basically, I have a Google App Engine app (Python standard) with data in Cloud Datastore. I can run a cron-job to perform backups automatically and regularly as described in the docs Scheduled Backups and I can also write a bit of Python code to execute backup tasks which is triggered manually as described in this SO answer. I plan to write a small Python cron-job that would perform the task to find the most recent backup_info file of a given kind and copy/rename it to the desired location.
Either way, the original backup location will be crowded with lots of files and folders during a day, especially if there is more than one backup for a certain kind. For example in gs://MY-PROJECT.appspot.com/ I will find:
VeryLoooooongRandomLookingString.backup_info
OtherStringForSecondBackup.backup_info
OtherStringForThirdBackup.backup_info
The string seems to be a unique identifier for every backup execution. I assume, it contains a list of *.backup_info files, one for each kind in the backup.
VeryLoooooongRandomLookingString.Comment.backup_info
OtherStringForSecondBackup.Comment.backup_info
OtherStringForThirdBackup.Comment.backup_info
For every kind in the backup, e.g. "Comment". It seems it contains a list of actual backup data for this kind and this backup.
datastore_backup_CUSTOM_PREFIX_2017_09_20_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_1_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_2_Comment/
Data folder for each backup and kind. Here for kind "Comment", backed up three times on 9/20.
My questions are related to Datastore and/or Storage:
Is it possible to explicitly specify a custom UID as a query parameter (or in HTTP header) when calling /_ah/datastore_admin/backup.create?
If not, is it possible to send a message with the UID to a hook or something, after the backup has been completed?
If (1) and (2) is not possible: Which approach would be the best in Storage to find the latest *.backup_info file for a given kind? It seems that listbucket() doesn't allow filtering, and I don't think that iterating through hundreds or thousands of files looking for certain name patterns would be efficient.
I have found two solutions for the problem, one is in GA and one is in Beta.
The answers in short:
The GA Datastore Export & Import service allows custom and predictable paths to the backup
and its API for long-running operations allows to get the output URL of a backup job (e.g. for paths with timestamps).
A Cloud Function triggered by Cloud Storage events would allow to handle just specific [KIND].backup_info files as soon as they are added to a bucket, instead of paging through thousands of files in the bucket each time.
Datastore Export & Import
This new service has an API to run export jobs (manually or scheduled). The job allows to specify the path and produces predictable full paths, so existing backup files could be overwritten if only the latest backup is needed at any time, e.g.:
gs://[YOUR_BUCKET]/[PATH]/[NAMESPACE]/[KIND]/[NAMESPACE]_[KIND].export_metadata
For cron-jobs, the App Engine handler URL is /cloud-datastore-export (instead of the old /_ah/datastore_admin/backup.create). Also the format of the export is different from the old export. It can be imported to BigQuery, too, just like the old [KIND].backup_info files.
Cloud Function
Deploy a Cloud Function (JavaScript / Node.js) that is triggered by any change in the backup bucket and if that file exists (file.resourceState === 'not_exists'), is new (file.metageneration === '1') and in fact is one of the [KIND].backup_info files we want, it will be copied to a different bucket ("latest_backups" or so). Custom metadata on the copy can be used to compare timeCreated in later executions of the function (so we don't accidentally overwrite more recent backup file with older file). Copying or moving actual backup payload will break the references inside the [KINDNAME].backup_info files though.
Background Cloud Function with a Cloud Storage trigger
How to copy files in Cloud Functions (Node.JS)

How to Detect Errors in Apex Data Loader Batch Execution

We have a DOS Batch job which runs a multi-step process to:
Delete all records from salesforce for a specific object (download IDs and then delete them using Data Loader)
Deletes all records from a database table which mirrors the Salesforce data.
Extracts data from a database and uploads the data to the Salesforce objects using Data Loader.
Downloads the Salesforce data into the database table.
Recently, the first step has been failing with a QUERY-TIMEOUT error. If I rerun the process, it generally works OK without any other changes. This is being investigated, but is not my question.
My question is: How can I detect when step 1 (which uses Data Loader) in the batch file fails? If this fails, I do not want to proceed with the rest of the process, as this deletes the database data which is used elsewhere for reporting.
Does the Apex Loader set an ERRORLEVEL if it fails? How else can I determine that there was a failure?
Thanks.
Ron Ventura
Please to view more detail refer to the link below. Basically is to check for the log file that the data loader generates when there was an error, so if no errors where found the log files are empty, If the pass is 100% successful, the error log will have a header line and no rows.
https://www.nimbleuser.com/blog/failing-safe-with-the-apex-data-loader-for-salesforce-crm
And also you can refer to this answer.
https://salesforce.stackexchange.com/questions/14466/availability-of-apex-data-loader-error-file-from-local-pc-to-salesforce
Regards.

File created from base64 string causes memory leak on server

The company that I work for has come across a pretty significant issue with one of our releases that has brought our project to a screeching halt.
A third party application that we manage, generates word documents from base64 encoded strings stored in our SQL Server. The issue that we are having is that in some cases, when one of these documents are sent via SMTP and the file is opened by the user, the file fails to open.
When the file fails, the server locks up. The memory and cpu then grow exponentially on the server to the point that the only option is to kill the process from the server-side in order to prevent failure and down time for the rest of the users on the network.
We are using Windows 7 with Microsoft Office 2013 and the latest version of SQL Server.
What is apparent is that the word document created from the base64 string is corrupt. What isn't apparent is how this appears to bring the entire server system down in one fell swoop.
Has anyone come across this issue before and if so, what was the solution that you came up with? We do not have access to the binaries of the 3rd party application that generates the files. We aren't able to reproduce the issue manually in order to come up with a working testcase to present to the 3rd party, so we are stumped. Any ideas?
I would need more details to understand your scenario. You say this is the order of events:
1. Word file is sent via SMTP (presumably an email to an Outlook client)
2. User receives email; opens attached file
3. Memory and CPU on server go to 100 percent. This creates downtime for rest of the users.
4. Need to kill this process to recover.
Since Outlook is a client-side application, it must be the Word document attachment that is causing this problem. Can you post a sample document in a public place, like a free OneDrive account? Presumably this document creates the problem. Maybe it has some VBA code? Try this with a blank document.

Resources