I have some features that working with large amount data. I'm trying to use the Laravel model cursor to get data from database.
My concern is whether the Laravel model cursor will lock the table, because the feature will work with large amount data, so it will take time to execute. If the table is locked, another features need to wait until the feature finished.
Does Laravel cursor lock the table?
Related
I am working with data from Snowflake Marketplace, these are large multi-billion record tables.
I have two conflicting needs: speed & up-to-date data
I am able to have up-to-date data by working exclusively with views - meaning the data is up to date from the vendor's perspective at the moment I make a query. However, performance is terrible (the vendor does not cluster their tables the way I would do it)
I can also materialize copies of the tables with my chosen cluster keys. This works great for performance, but it introduces a 10-20h lag every time the tables are updated - which is not good.
My main issue is that this data is "changed" all the time. Ie current and historical values are updated by the vendor in-place (not append). This makes incremental runs almost impossible.
Does Snowflake have any feature that could help in this context?
You mentioned that it's a table, not a view that they're sharing. You can request permission from the vendor of this data to be able to place a stream on their table. Once you have a stream on their table, you'll get all the rows you need to complete a synchronization of their table changes with your local copy. This should reduce the 10-20h lag because without setting a stream on their side you'll wind up doing full refreshes. This approach will allow you to handle incremental changes.
When you try to create a stream on a shared table, unless you've already arranged it with the vendor or the vendor has already enabled this for another share consumer, you may get this message:
SQL access control error: Insufficient privileges to operate on stream
source without CHANGE_TRACKING enabled 'MY_TABLE'
This just means the sharing vendor must enable change tracking on their side. On the sharing account side:
alter table MY_TABLE set change_tracking = true;
As soon as they make that change, any and all sharing consumers will be able to create a stream on the table:
status
Stream MY_STREAM successfully created.
We have a Kafka consumer service to ingest data into our DB. Whenever we receive the message from the topic we will compose an insert statement to insert this message into DB. We use DB connection pool to handle the insertion and so far so good.
Currently, we need to add a filter to select only the related message from Kafka and do the insert. There are two options in my mind to do this.
Option 1: Create a config table in the DB to define our filtering condition.
Pros
No need to make code changes or redeploy services
Just insert new filters to config table, service will pick them the next run
Cons
eed to query the DB every time we receive new messages.
Say we received 100k new messages daily and need to filter out 50k. So totally we only need to run 50k INSERT commands, while need to run 100K SELECT queries to check the filter condition for every single Kafka message.
Option 2: Use a hardcoded config file to define those filters.
Pros
Only need to read the filters once when the consumer start running
Has no burden on the DB layer
Cons
This is not a scalable way since we are planning to add a lot of filters, everytime we need to make code changes on the config file and redeploy the consumer.
My question is, is there a better option to achieve the goal? Find the filters without using hardcoded config file or without increasing the concurrency of DB queries.
Your filters could be in another Kafka topic.
Start your app and read the topic until the end, and only then start doing database inserts. Store each consumed record in some local structure such as ConcurrentHashmap, SQLite, RocksDB (provided by Kafka Streams), or DuckDB is popular recently...
When you add a new filter, your consumer would need to temporarily pause your database operations
If you use Kafka Streams, then you could lookup data from the incoming topic against your filters "table" statestore using Processor API and drop the records from the stream
This way, you separate your database reads and writes once you start inserting 50k+ records, and your app wouldn't be blocked trying to read any "external config"
You could also use Zookeeper, as that's one of its use cases
I would like to create a per-user view of data tables stored in Flink, which is constantly updated as changes happen to the source data, so that I can have a constantly updating UI based on a toChangelogStream() of the user's view of the data. To do that, I was thinking that I could create an ad-hoc SQL query like SELECT * FROM foo WHERE userid=X and convert it to a changelog stream, which would have a bunch of inserts at the beginning of the stream to give me the initial state, followed by live updates after that point. I would leave that query running as long as the user is using the UI, and then delete the table when the user's session ends. I think this is effectively how the Flink SQL client must work, so it seem like this is possible.
However, I anticipate that there may be some large overheads associated with each ad hoc query if I do it this way. When I write a SQL query, based on the answer in Apache Flink Table 1.4: External SQL execution on Table possible?, it sounds like internally this is going to compile a new JAR file and create new pipeline stages, I assume using more JVM metaspace for each user. I can have tens of thousands of users using the UI at once, so I'm not sure that's really feasible.
What's the idiomatic way to do this? The other ways I'm looking at are:
I could maybe use queryable state since I could group the current rows behind the userid as the key, but as far as I can tell it does not provide a way to get a changelog stream, so I would have to constantly re-query the state on a periodic basis, which is not ideal for my use case (the per-user state can be large sometimes but doesn't change quickly).
Another alternative is to output the table to both a changelog stream sink and an external RDBMS sink, but if I do that, what's the best pattern for how to join those together in the client?
In our application we currently use dynamoDb to store the notification details. So a scheduler runs twice a day which queries "notificationType"(pk -> notifiactionType, sk -> userId).
In each item there is an attribute(timestamp), based on which if the timestamp is more than the current time will send a trigger(more business logic that for some records one day after the timestamp a mail needs to be sent). Now once the user performs the activity for which the notification is sent, then will delete the entry
My query is that, if the data grows large for a notificationType, then retrieval of all the data is redundant because for some records the notification is not going to be sent. Hence more read capacity is used and that might potentially increase the cost in later point of time.
In this case would it be wise to use the existing dynamoDb or move to any other db like mongoDb, cassandra or any other db.
Note: My primary concern is the cost
Another option is to use a workflow engine that can model the notification process per user instead of a batch job. This way you can avoid scanning large amounts of data as the engine would rely on durable timers to execute actions at the appropriate time.
My open-source project temporal.io which I led at Uber is used by multiple companies for notification like scenarios and was tested up 200 million open parallel workflows.
I've been recently trying to come up with a retry mechanism for Google's Big Query streaming api for running DML queries with UPDATE statement over Rows that could sometimes still be in the Streaming Buffer. As these rows have not yet been exported to the table, BI's api forbids UPDATE or DELETE statements to be ran on them. As I understand there is no way to manually flush the Streaming Buffer yourself.
My question is, is there a way or a good practice for a call with some sort of retry mechanism that will do this for the announced possible 90 minute of wait time (that the rows can be in the buffer)?
I would recommend copy data from table where streaming occurs to another table on which DML could be run without any limitations. This another permanent table could be created with jobs.insert API
You need to treat this permanent table as source of true, on original table you could enable table's expiration time or partition expiration depending on your needs and coping frequency.
Now when you have permanent table you could run some other processing on that data or generate report etc.
Drawback of above is that you could have some late data anyway you should fetch/copy and deduplicate reasonable window of data to permanent table to guarantee the most recent data
I assume that streaming to your table could be run again and again, so you could theoretically be never able to run you DML as streaming buffer could be never empty.
Anyway if you still need run some retry mechanism try using something like https://github.com/awaitility/awaitility