Using S3 as a sink (StreamingFileSink)

Using S3 as a sink (StreamingFileSink) - apache-flink

We are using Flink to process input events and aggregate and write o/p of our streaming job to S3 using StreamingFileSink but whenever we try to restore the job from a savepoint, the restoration fails with missing part files error. As per my understanding, s3 deletes those part(intermittent) files and can no longer be found on s3. Is there a workaround for this, so that we can use s3 as a sink?

Related

flink disk usage in job manager increases after every job submission over rest

I have deployed my own flink setup in AWS ECS. One Service for JobManager and one Service for task Managers. I am running one ECS task for job manager and 3 ecs tasks for TASK managers.
I have a kind of batch job which I upload using flink rest every-day with changing new arguments, when I submit each time disk memory getting increased by ~ 600MB, I have given a checkpoint as S3 . Also I have set historyserver.archive.clean-expired-jobs true .
Since I am running on ECS, not able to find why the memory is getting increased on every jar upload and execution.
What are the flink config params I should look to make sure the memory is not shooting up on every new job upload?

Try this configuration.
blob.service.cleanup.interval:
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#blob-service-cleanup-interval
historyserver.archive.retained-jobs:
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#historyserver-archive-retained-jobs

Snowflake Ingest Pipeline - Automatically Remove Ingest Files From Source?

We are wanting to ingest events into snowflake from S3 bucket. I know this is possible from this documentation: https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-s3.html
But after the files have been ingested we'd like to either delete the files that have been ingested or remove it form the ingest bucket.
1: Events loaded into ingest bucket via Firehose (or direct)
2: Snowflake automatically ingests events from this bucket
3: Snowflake either A) moves the files to a new bucket or B) sends a message (SNS?) to a process (lambda) to move the processed file out of the ingest bucket.
Is this possible with Snowflake? I'd like to use the automatic pipeline feature but it looks like the only way I can get this behavior would be to write the pipeline ourselves.

Snowflake snowpipe files are not auto ingesting after adding file into AWS s3

I have created snowpipe for one of snowflake table. Source file will be landed in AWS S3 bucket at periodical time, So followed below steps to create snowpipe:
Created external stage
Queried the files using "PUT" command (Able to see the list of available files in result panel)
Created snowpipe
Configured SQS notification on top of S3 bucket
Added one sample file and its noy loaded automatically
Altered snowpipe using following command:
alter pipe snowpipe_content refresh;
The file got added into snowflake target table after some time.
Can someone please help me to figure out what I missed on snowpipe setup

Follow the below setps to trouble shoot your snowpipe:
Step: I : Check the status of your snowpipe:
SELECT SYSTEM$PIPE_STATUS('pipe_name');
Make sure your pipe status is RUNNING
Step: II: Check copy history for the table associated with snowpipe:
select
*
from
table(information_schema.copy_history(table_name=>'table_name', start_time=> dateadd(hours, -1, current_timestamp())));
Ensure the file is not loaded from the list / errored.
Step III: Validate your snowpipe load
select
*
from
table(validate_pipe_load(
pipe_name=>'pipe_name',
start_time=>dateadd(hour, -1, current_timestamp()))
);
If above steps looks good, Might be issue with your SQS notification set up:
Follow the snowflake article by referring below link:
Snowflake KB

Snowpipe not working after upload same file twice

Just playing around with Snowpipe. I had it working. I would drop a file onto S3 and Snowpipe loaded the data into a Snowflake table.
However when I copied the same file twice into the S3 bucket, Snowpipe didnt pick it up or any subsequent files where were not duplicate.
To illustrate:
Uploaded file1.txt into the S3 bucket - success
Uploaded file2.txt into the S3 bucket - success
Uploaded file3.txt into the S3 bucket - success
Re-Uploaded file1.txt into the S3 bucket - no result - table was not updated
Uploaded file4.txt into the S3 bucket - no result - table was not updated
How do I go about troubleshooting this? or fixing this issue.
Thanks

A few clarifications:
Yes, Snowpipe will not load a file again. If there is an error in
the file and you need to modify it, you will need to rename it (e.g.
file1v2.txt)
The behavior you noticed regarding the next file not being loaded is
unexpected & requires troubleshooting. Is there any issue with the
next file (since it is showing up as pending file count of 1)? Are
you able to access it otherwise from outside Snowflake? Can you run
COPY on it to load it to say another table?
Snowpipe behaves similarly on Azure & AWS except for queue ownership
(Azure blob store will not deliver to a queue in another
subscription).
Multiple pipes share the same queue on AWS and we use the
bucket/prefix to demultiplex to different pipes.
Dinesh Kulkarni
(PM, Snowflake)

Spark job callback

Maybe you can help me with my problem
I start spark job on google-dataproc through API. This job writes results on the google data storage.
When it will be finished I want to get a callback to my application.
Do you know any way to get it? I don't want to track job status through API each time.
Thanks in advance!

I'll agree that it would be nice if there was to either wait for or get a callback for when operations such as VM creation, cluster creation, job completion, etc finish. Out of curiosity, are you using one of the api clients (like google-cloud-java), or are you using the REST API directly?
In the mean time, there are a couple of workarounds that come to mind:
1) Google Cloud Storage (GCS) callbacks
GCS can trigger callbacks (either Cloud Functions or PubSub notifications) when you create files. You can create an file at the end of your Spark job, which will then trigger a notification. Or, just add a trigger for when you put an output file on GCS.
If you're modifying the job anyway, you could also just have the Spark job call back directly to your application when it's done.
2) Use the gcloud command line tool (probably not the best choice for web servers)
gcloud already waits for jobs to complete. You can either use gcloud dataproc jobs submit spark ... to submit and wait for a new job to finish, or gcloud dataproc jobs wait <jobid> to wait for an in-progress job to finish.
That being said, if you're purely looking for a callback for choosing whether to run another job, consider using Apache Airflow + Cloud Composer.
In general, the more you tell us about what you're trying to accomplish, we can help you better :)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Using S3 as a sink (StreamingFileSink) - apache-flink

Related

flink disk usage in job manager increases after every job submission over rest

Snowflake Ingest Pipeline - Automatically Remove Ingest Files From Source?

Snowflake snowpipe files are not auto ingesting after adding file into AWS s3

Snowpipe not working after upload same file twice

Spark job callback

Categories

Resources