Is there a way to reduce the time for the snowpipe load time? - snowflake-cloud-data-platform

I'm reading this User Guide and it mentions that "typically" snowpipe takes 1 minute to load the data. In my experiments I found that it takes a minute always. Where is this 1 minute latency coming from? It feels like there is some batch processing going on per minute. Is there a setting somewhere to reduce it further down.

As of today, there's no setting to reduce this latency - you're essentially microbatching to the minute by minute level.
If you want to do more frequent updates, your best option is to keep a warehouse running and either submit UPDATE or COPY queries to it.
If you don't require sub-minute latency, you should use Snowpipe and potentially a tool like Kinesis Firehose to batch up records into a single file that drops into S3 once per minute.

Related

Can we modify S3 write/commit time with Flink S3 sink

We are using Flink bulkWriter with OnCheckpointRollingPolicy. Checkpointing interval is set at 35sec which means all s3 write/commit happens on 35th sec. I have noticed few scenario where due to intermittent backpressure(for 1-5 mins) in job, checkpointing sometimes gets delayed by few seconds. We are partitioning data at minute level in s3 bucket and there is one ETL job which process this data with one minute delay. Due to some delay checkpointing, ETL job didn't find the data for that minute but it gets added later.
To solve this issue, we want to ensure that if there is data delay, delayed data should gets added in later minute folder not the previous minute folder so that ETL job will pick the past missing data in later minuite folders.
I didnt find any way to modify this s3 write time and looks like it is coupled with checkpointing time only.
Please let me know if there is any way to fix this issue.
FileSink.forBulkFormat(
s3SinkPath,
new ParquetWriterFactory<>(builder))
.withBucketAssigner(new CustomMinuteLevelPartitionBucketAssigner())
.withBucketCheckInterval(35000)
.build();

Write flink stream to relational database

I am working on a flink project which write stream to a relational database.
In the current solution, we wrote a custom sink function which open transaction, execute SQL insert statement and close transaction. It works well until the the data volume increases and we started getting connection timeout issues. We tried a few connection pool configuration adjustment, it does not help much.
We are thinking of trying "batch-insert", so to decrease the number of "writes" to the database. We come across a few classes which do almost what we want: JDBCOutputFormat, JDBCSinkFunction. With JDBCOutputFormat, we can configure the batch size.
We would also like to force a "batch-insert" every 1 minutes if the number of records does not exceed the "batch-size". How would you normally deal with these kinds of problems? My first thoughts is to extend JDBCOutputFormat to use scheduled tasks to force flush every 1 minute, but it was not obvious how it could be done.
Do we have to write our own sink all together?
Updated:
JDBCSinkFunction does a flush and batch execute each time Flink checkpoints. So long as you are doing checkpointing, the batches won't be any longer than the checkpointing interval.
However, having read this mailing list thread, I see that JDBCSinkFunction does not support exactly-once output.

Tech-stack for querying and alerting on GB scale (streaming and at rest) datasets

Trying to scope out a project that involves data ingestion and analytics, and could use some advice on tooling and software.
We have sensors creating records with 2-3 fields, each one producing ~200 records per second (~2kb/second) and will send them off to a remote server once per minute resulting in about ~18 mil records and 200MB of data per day per sensor. Not sure how many sensors we will need but it will likely start off in the single digits.
We need to be able to take action (alert) on recent data (not sure the time period guessing less than 1 day), as well as run queries on the past data. We'd like something that scales and is relatively stable .
Was thinking about using elastic search (then maybe use x-pack or sentinl for alerting). Thought about Postgres as well. Kafka and Hadoop are definitely overkill. We're on AWS so we have access to tools like kinesis as well.
Question is, what would be an appropriate set of software / architecture for the job?
Have you talked to your AWS Solutions Architect about the use case? They love this kind of thing, they'll be happy to help you figure out the right architecture. It may be a good fit for the AWS IoT services?
If you don't go with the managed IoT services, you'll want to push the messages to a scalable queue like Kafka or Kinesis (IMO, if you are processing 18M * 5 sensors = 90M events per day, that's >1000 events per second. Kafka is not overkill here; a lot of other stacks would be under-kill).
From Kinesis you then flow the data into a faster stack for analytics / querying, such as HBase, Cassandra, Druid or ElasticSearch, depending on your team's preferences. Some would say that this is time series data so you should use a time series database such as InfluxDB; but again, it's up to you. Just make sure it's a database that performs well (and behaves itself!) when subjected to a steady load of 1000 writes per second. I would not recommend using a RDBMS for that, not even Postgres. The ones mentioned above should all handle it.
Also, don't forget to flow your messages from Kinesis to S3 for safe keeping, even if you don't intend to keep the messages forever (just set a lifecycle rule to delete old data from the bucket if that's the case). After all, this is big data and the rule is "everything breaks, all the time". If your analytical stack crashes you probably don't want to lose the data entirely.
As for alerting, it depends 1) what stack you choose for the analytical part, and 2) what kinds of triggers you want to use. From your description I'm guessing you'll soon end up wanting to build more advanced triggers, such as machine learning models for anomaly detection, and for that you may want something that doesn't poll the analytical stack but rather consumes events straight out of Kinesis.

Alternative for Polling Record State

We currently have a payment tracking system which uses MS SQL Server Enterprise. When a client requests a service, he would have to do the payment within 24 hours, otherwise we would send him an SMS Reminder. Our current implementation simply records the date and time of the purchase, and keep on polling constantly the records in order to find "expired" purchases.
This is generating so much load on the database that we have to implement some form of replication in order to offload these operations to another server.
I was thinking: is there a way to combine CLR triggers with some kind of a scheduler that would be triggered only once, that is, 24 hours after the purchase is created?
Please keep in mind that we have tens of thousands of transactions per hour.
I am not sure how you are thinking that SQLCLR will solve this problem. I don't think this needs to be handled in the DB at all.
Since the request time doesn't change, why not load all requests into a memory-based store that you can hit constantly. You would load the 24-hour-from-request time so that you only need to compare those times to Now. If the customer pays prior to the 24-hour period then you remove the entry from the cache. Else, the polling process will eventually find it, process it, and remove it from the cache.
OR, similarly, you can use a scheduler and load a future event to be the SMS message, based on the 24-hour-from-request time, upon each request. Similar to scheduling an action using "AT". Again, if someone pays prior to that time, just remove the scheduled task/event/reminder.
You would store just the 24-hour-after-time and the RequestID. If the time is reached, the service would refer back to the DB using that RequestID to get the current info.
You just need to make sure to de-list items from the cache / scheduler if payment is made prior to the 24-hour-after time.
And if the system crashes / restarts, you just load all entries that are a) unpaid, and b) have not yet reached their 24-hour-after time.

How to limit statistics in Gatling report to steady state

I have a classical Gatling benchmark setup, I ramp users for a while and then keep a constant rate. In the very nice reports Gatling shows, there are some stats but IIUC these cover the whole scenario - not just steady state. Is it possible to limit this to only certain part of the scenario? (basically just ignore all requests started or finished before and after this period?) Or do I have to manually truncate the simulation logs?
That's only available in FrontLine, our commercial product. You can get stats on any time window.

Resources