We are using Flink to process input events and aggregate and write o/p of our streaming job to S3 using StreamingFileSink but whenever we try to restore the job from a savepoint, the restoration fails with missing part files error. As per my understanding, s3 deletes those part(intermittent) files and can no longer be found on s3. Is there a workaround for this, so that we can use s3 as a sink?
Related
I have deployed my own flink setup in AWS ECS. One Service for JobManager and one Service for task Managers. I am running one ECS task for job manager and 3 ecs tasks for TASK managers.
I have a kind of batch job which I upload using flink rest every-day with changing new arguments, when I submit each time disk memory getting increased by ~ 600MB, I have given a checkpoint as S3 . Also I have set historyserver.archive.clean-expired-jobs true .
Since I am running on ECS, not able to find why the memory is getting increased on every jar upload and execution.
What are the flink config params I should look to make sure the memory is not shooting up on every new job upload?
Try this configuration.
blob.service.cleanup.interval:
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#blob-service-cleanup-interval
historyserver.archive.retained-jobs:
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#historyserver-archive-retained-jobs
We are wanting to ingest events into snowflake from S3 bucket. I know this is possible from this documentation: https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-s3.html
But after the files have been ingested we'd like to either delete the files that have been ingested or remove it form the ingest bucket.
1: Events loaded into ingest bucket via Firehose (or direct)
2: Snowflake automatically ingests events from this bucket
3: Snowflake either A) moves the files to a new bucket or B) sends a message (SNS?) to a process (lambda) to move the processed file out of the ingest bucket.
Is this possible with Snowflake? I'd like to use the automatic pipeline feature but it looks like the only way I can get this behavior would be to write the pipeline ourselves.
I have created snowpipe for one of snowflake table. Source file will be landed in AWS S3 bucket at periodical time, So followed below steps to create snowpipe:
Created external stage
Queried the files using "PUT" command (Able to see the list of available files in result panel)
Created snowpipe
Configured SQS notification on top of S3 bucket
Added one sample file and its noy loaded automatically
Altered snowpipe using following command:
alter pipe snowpipe_content refresh;
The file got added into snowflake target table after some time.
Can someone please help me to figure out what I missed on snowpipe setup
Follow the below setps to trouble shoot your snowpipe:
Step: I : Check the status of your snowpipe:
SELECT SYSTEM$PIPE_STATUS('pipe_name');
Make sure your pipe status is RUNNING
Step: II: Check copy history for the table associated with snowpipe:
select
*
from
table(information_schema.copy_history(table_name=>'table_name', start_time=> dateadd(hours, -1, current_timestamp())));
Ensure the file is not loaded from the list / errored.
Step III: Validate your snowpipe load
select
*
from
table(validate_pipe_load(
pipe_name=>'pipe_name',
start_time=>dateadd(hour, -1, current_timestamp()))
);
If above steps looks good, Might be issue with your SQS notification set up:
Follow the snowflake article by referring below link:
Snowflake KB
Just playing around with Snowpipe. I had it working. I would drop a file onto S3 and Snowpipe loaded the data into a Snowflake table.
However when I copied the same file twice into the S3 bucket, Snowpipe didnt pick it up or any subsequent files where were not duplicate.
To illustrate:
Uploaded file1.txt into the S3 bucket - success
Uploaded file2.txt into the S3 bucket - success
Uploaded file3.txt into the S3 bucket - success
Re-Uploaded file1.txt into the S3 bucket - no result - table was not updated
Uploaded file4.txt into the S3 bucket - no result - table was not updated
How do I go about troubleshooting this? or fixing this issue.
Thanks
A few clarifications:
Yes, Snowpipe will not load a file again. If there is an error in
the file and you need to modify it, you will need to rename it (e.g.
file1v2.txt)
The behavior you noticed regarding the next file not being loaded is
unexpected & requires troubleshooting. Is there any issue with the
next file (since it is showing up as pending file count of 1)? Are
you able to access it otherwise from outside Snowflake? Can you run
COPY on it to load it to say another table?
Snowpipe behaves similarly on Azure & AWS except for queue ownership
(Azure blob store will not deliver to a queue in another
subscription).
Multiple pipes share the same queue on AWS and we use the
bucket/prefix to demultiplex to different pipes.
Dinesh Kulkarni
(PM, Snowflake)
Maybe you can help me with my problem
I start spark job on google-dataproc through API. This job writes results on the google data storage.
When it will be finished I want to get a callback to my application.
Do you know any way to get it? I don't want to track job status through API each time.
Thanks in advance!
I'll agree that it would be nice if there was to either wait for or get a callback for when operations such as VM creation, cluster creation, job completion, etc finish. Out of curiosity, are you using one of the api clients (like google-cloud-java), or are you using the REST API directly?
In the mean time, there are a couple of workarounds that come to mind:
1) Google Cloud Storage (GCS) callbacks
GCS can trigger callbacks (either Cloud Functions or PubSub notifications) when you create files. You can create an file at the end of your Spark job, which will then trigger a notification. Or, just add a trigger for when you put an output file on GCS.
If you're modifying the job anyway, you could also just have the Spark job call back directly to your application when it's done.
2) Use the gcloud command line tool (probably not the best choice for web servers)
gcloud already waits for jobs to complete. You can either use gcloud dataproc jobs submit spark ... to submit and wait for a new job to finish, or gcloud dataproc jobs wait <jobid> to wait for an in-progress job to finish.
That being said, if you're purely looking for a callback for choosing whether to run another job, consider using Apache Airflow + Cloud Composer.
In general, the more you tell us about what you're trying to accomplish, we can help you better :)