How can I know if Snowflake will create a new SQS Queue? - snowflake-cloud-data-platform

According to the Snowflake documentation here:
Following AWS guidelines, Snowflake designates no more than one SQS
queue per S3 bucket. This SQS queue may be shared among multiple
buckets in the same AWS account. The SQS queue coordinates
notifications for all pipes connecting the external stages for the S3
bucket to the target tables. When a data file is uploaded into the
bucket, all pipes that match the stage directory path perform a
one-time load of the file into their corresponding target tables.
I am configuring Snowpipe and relying on the ARN of the SQS Queue provided by snowflake (which can be queried via DESCRIBE PIPE <pipe name>). But I am confused regarding the statement:
This SQS queue may be shared among multiple
buckets in the same AWS account
Does Snowflake use a single SQS Queue for all buckets? How do I know whether to use the same SQS Queue or if Snowflake will create a new one?

You can check the SQS queues used by the pipes:
select pipe_name, notification_channel_name from information_schema.pipes;
I created 3 pipes for testing which uses 2 buckets in same account, and I see that all of them uses the same SQS queue.
https://docs.snowflake.com/en/sql-reference/info-schema/pipes.html

Related

Snowflake Ingest Pipeline - Automatically Remove Ingest Files From Source?

We are wanting to ingest events into snowflake from S3 bucket. I know this is possible from this documentation: https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-s3.html
But after the files have been ingested we'd like to either delete the files that have been ingested or remove it form the ingest bucket.
1: Events loaded into ingest bucket via Firehose (or direct)
2: Snowflake automatically ingests events from this bucket
3: Snowflake either A) moves the files to a new bucket or B) sends a message (SNS?) to a process (lambda) to move the processed file out of the ingest bucket.
Is this possible with Snowflake? I'd like to use the automatic pipeline feature but it looks like the only way I can get this behavior would be to write the pipeline ourselves.

How to scale pull queues with Google Cloud Tasks

I have a GAE/P/Standard/FirstGen app that sends a lot of email with Sendgrid. Sendgrid sends my app a lot of notifications for when email is delivered, opened, etc.
This is how I process the Sendgrid notifications:
My handler processes the Sendgrid notification and adds a task to a pull queue
About once every minute I lease a batch of tasks from the pull queue to process them.
This works great except when I am sending more emails than usual. When I am adding tasks to the pull queue at a high rate, the pull queue refuses to lease tasks (it responds with TransientError) so the pull queue keeps filling up.
What is the best way to scale this procedure?
If I create a second pull queue and split the tasks between the two of them, will that double my capacity? Or is there something else I should consider?
====
This is how I add tasks:
q = taskqueue.Queue("pull-queue-name")
q.add(taskqueue.Task(data, method="PULL", tag=tag_name))
I have found some information about it in Google documentation here. According to it solution for TransientError should be to:
catch these exceptions, back off from calling lease_tasks(), and then
try again later.
etc.
Actually I suppose this is App Engine Task queue, not Cloud Tasks which are different product.
According to my understanding there is no option to scale this better. It seems that the some solution might be to migrate to Cloud Task and Pub/Sub which is better way to manage queues in GAE as you may find here.
I hope it will help somehow... :)

Is Amazon SQS a good tool for handling analytics logging data to a database?

We have a few nodejs servers where the details and payload of each request needs to be logged to SQL Server for reporting and other business analytics.
The amount of requests and similarity of needs between servers has me wanting to approach this with an centralized logging service. My first instinct is to use something like Amazon SQS and let it act as a buffer with either SQL Server directly or build a small logging server which would make database calls directed by SQS.
Does this sound like a good use for SQS or am I missing a widely used tool for this task?
The solution will really depend on how much data you're working with, as each service has limitations. To name a few:
SQS
First off since you're dealing with logs, you don't want duplication. With this in mind you'll need a FIFO (first in first out) queue.
SQS by itself doesn't really invoke anything. What you'll want to do here is setup the queue, then make a call to submit a message via the AWS JS SDK. Then when you get the message back in your callback, get the message ID and pass that data to an invoked Lambda function (you can write those in NodeJS as well) which stores the info you need in your database.
That said it's important to know that messages in an SQS queue have a size limit:
The minimum message size is 1 byte (1 character). The maximum is
262,144 bytes (256 KB).
To send messages larger than 256 KB, you can use the Amazon SQS
Extended Client Library for Java. This library allows you to send an
Amazon SQS message that contains a reference to a message payload in
Amazon S3. The maximum payload size is 2 GB.
CloudWatch Logs
(not to be confused with the high level cloud watch service itself, which is more sending metrics)
The idea here is that you submit event data to CloudWatch logs
It also has a limit here:
Event size: 256 KB (maximum). This limit cannot be changed
Unlike SQS, CloudWatch logs can be automated to pass log data to Lambda, which then can be written to your SQL server. The AWS docs explain how to set that up.
S3
Simply setup a bucket and have your servers write out data to it. The nice thing here is that since S3 is meant for storing large files, you really don't have to worry about the previously mentioned size limitations. S3 buckets also have events which can trigger lambda functions. Then you can happily go on your way sending out logo data.
If your log data gets big enough, you can scale out to something like AWS Batch which gets you a cluster of containers that can be used to process log data. Finally you also get a data backup. If your DB goes down, you've got the log data stored in S3 and can throw together a script to load everything back up. You can also use Lifecycle Policies to migrate old data to lower cost storage, or straight remove it all together.

Websphere MQ - Topic Subscription with multiple consumers

I have a micro-service which subscribes to a topic in WebSphere MQ. The subscription is managed and durable. I explicitly set the subscription name, so that it can be used to connect back to the queue, after recovering from any micro service failure. The subscription works as expected.
But I might have to scale up the micro service and run multiple instances. In this case I will end up with having multiple consumers to the same topic. But here it fails with error 2429 : MQRC_SUBSCRIPTION_IN_USE. I am not able to run more than one consumer to the topic subscription. Note : A message should be sent only to one of the consumers.
Any thought ?
IBM Websphere Version : 7.5
I use the C-client API to connect to the MQ.
When using a subscriber what you describe is only supported via the IBM MQ Classes for JMS API. In v7.0 and later you can use Cloned subscriptions (this is a IBM extension to the JMS spec), in addition in MQ v8.0 and later you can alternately use Shared subscriptions which is part of the JMS 2.0 spec. With these two options, multiple subscribers can be connected to the same subscription and only one of them will receive each published message.
UPDATE 20170710
According to this APAR IV96489: XMS.NET DOESN'T ALLOW SHARED SUBSCRIPTIONS EVEN WHEN CLONESUP PROPERTY IS ENABLED, XMS.NET is also supposed to support Cloned subscriptions but due to a defect this is will not be supported until 8.0.0.8 or 9.0.0.2 or if you request the IFIX for the APAR above.
You can accomplish something similar with other APIs like C by converting your micro-service to get from a queue instead of subscribing to a topic.
To get the published messages to the queue you have two options:
Setup a administrative subscription on the queue manager. You can do this a few different ways. The example below would be using a MQSC command.
DEFINE SUB('XYZ') TOPICSTR('SOME/TOPIC') DEST(SOME.QUEUE)
Create a utility app that can open a queue and create a durable subscription with that provided queue, the only purpose of this app would be to subscribe and unsubscribe a provided queue, it would not be used to consume any of the published messages.
Using the above method, each published message can only be read (GET) from the queue by one process or thread.

Multiple outputs streaming analytics

I'm using Stream Analytics with multiple outputs (sub queries) but we don't see any output and nu error message are stated in the logs. We are using IoT Hub as an input.
In addition to that: To prevent congestion from IoT Hub to ASA you should add consumer groups (in Messaging settings in you IoT Hub settings) and map the inputs of your ASA job to these separate consumer group.
Let us know if you need help with that.

Resources