Google BigQuery DML Update query on rows still in the Streaming buffer - google-app-engine

I've been recently trying to come up with a retry mechanism for Google's Big Query streaming api for running DML queries with UPDATE statement over Rows that could sometimes still be in the Streaming Buffer. As these rows have not yet been exported to the table, BI's api forbids UPDATE or DELETE statements to be ran on them. As I understand there is no way to manually flush the Streaming Buffer yourself.
My question is, is there a way or a good practice for a call with some sort of retry mechanism that will do this for the announced possible 90 minute of wait time (that the rows can be in the buffer)?

I would recommend copy data from table where streaming occurs to another table on which DML could be run without any limitations. This another permanent table could be created with jobs.insert API
You need to treat this permanent table as source of true, on original table you could enable table's expiration time or partition expiration depending on your needs and coping frequency.
Now when you have permanent table you could run some other processing on that data or generate report etc.
Drawback of above is that you could have some late data anyway you should fetch/copy and deduplicate reasonable window of data to permanent table to guarantee the most recent data
I assume that streaming to your table could be run again and again, so you could theoretically be never able to run you DML as streaming buffer could be never empty.
Anyway if you still need run some retry mechanism try using something like https://github.com/awaitility/awaitility

Related

How do you optimize query performance on marketplace shared views in snowflake? Note that incremental materialization probably would not work

I am working with data from Snowflake Marketplace, these are large multi-billion record tables.
I have two conflicting needs: speed & up-to-date data
I am able to have up-to-date data by working exclusively with views - meaning the data is up to date from the vendor's perspective at the moment I make a query. However, performance is terrible (the vendor does not cluster their tables the way I would do it)
I can also materialize copies of the tables with my chosen cluster keys. This works great for performance, but it introduces a 10-20h lag every time the tables are updated - which is not good.
My main issue is that this data is "changed" all the time. Ie current and historical values are updated by the vendor in-place (not append). This makes incremental runs almost impossible.
Does Snowflake have any feature that could help in this context?
You mentioned that it's a table, not a view that they're sharing. You can request permission from the vendor of this data to be able to place a stream on their table. Once you have a stream on their table, you'll get all the rows you need to complete a synchronization of their table changes with your local copy. This should reduce the 10-20h lag because without setting a stream on their side you'll wind up doing full refreshes. This approach will allow you to handle incremental changes.
When you try to create a stream on a shared table, unless you've already arranged it with the vendor or the vendor has already enabled this for another share consumer, you may get this message:
SQL access control error: Insufficient privileges to operate on stream
source without CHANGE_TRACKING enabled 'MY_TABLE'
This just means the sharing vendor must enable change tracking on their side. On the sharing account side:
alter table MY_TABLE set change_tracking = true;
As soon as they make that change, any and all sharing consumers will be able to create a stream on the table:
status
Stream MY_STREAM successfully created.

idiomatic way to do many dynamic filtered views of a Flink table?

I would like to create a per-user view of data tables stored in Flink, which is constantly updated as changes happen to the source data, so that I can have a constantly updating UI based on a toChangelogStream() of the user's view of the data. To do that, I was thinking that I could create an ad-hoc SQL query like SELECT * FROM foo WHERE userid=X and convert it to a changelog stream, which would have a bunch of inserts at the beginning of the stream to give me the initial state, followed by live updates after that point. I would leave that query running as long as the user is using the UI, and then delete the table when the user's session ends. I think this is effectively how the Flink SQL client must work, so it seem like this is possible.
However, I anticipate that there may be some large overheads associated with each ad hoc query if I do it this way. When I write a SQL query, based on the answer in Apache Flink Table 1.4: External SQL execution on Table possible?, it sounds like internally this is going to compile a new JAR file and create new pipeline stages, I assume using more JVM metaspace for each user. I can have tens of thousands of users using the UI at once, so I'm not sure that's really feasible.
What's the idiomatic way to do this? The other ways I'm looking at are:
I could maybe use queryable state since I could group the current rows behind the userid as the key, but as far as I can tell it does not provide a way to get a changelog stream, so I would have to constantly re-query the state on a periodic basis, which is not ideal for my use case (the per-user state can be large sometimes but doesn't change quickly).
Another alternative is to output the table to both a changelog stream sink and an external RDBMS sink, but if I do that, what's the best pattern for how to join those together in the client?

can (should) I use Flink like an in-memory database?

I've used batch Beam but am new to the streaming interface. I'm wondering about the appropriateness of using Apache Flink / Beam kind of like an in-memory database -- I'd like to constantly recompute and materialize one specific view of my data based on edge triggered updates.
More details: I have a few tables in a (normal) database, ranging from thousands to millions of rows, and each one has a many-to-many (M2M) relationship with other ones. Picture to explain:
Hosts <-M2M #1-> Table 1 <-M2M #2-> Table 2 <-M2M #3-> Table 3
Table 1 is a set of objects that the hosts need to know about, and each host needs to know about all downstream rows referenced directly or indirectly by the objects in Table 1 that it's related to. When changes happen anywhere other than the first many-to-many relationship M2M #1, it's not obvious which hosts need to be updated without traversing "left" to find the hosts and then traversing "right" to get all the necessary configuration. The objects and relationships at most levels change frequently, and I need sub-second latency to go from "a record or relationship changed" to recalculating any flattened config files with changes in them so that I can push updates to the hosts very quickly.
Is this an appropriate use case for streaming Flink / Beam? I have worked with Beam in a different system but only in batch mode, and I think that it would be a great tool to use here if I could edge-trigger it. The part I'm getting stuck on is, in batch mode, the PCollections are all "complete" in the sense that I can always join all records in one table with all records in another table. But with streaming, once I process a record once, it gets removed from its PCollection and can't be joined against future updates that arrive later on and relate to it, right? IIUC, it's only available within a window, but I effectively want an infinitely long window where only outdated versions of items in a PCollection (e.g. versions of them which have been overwritten by a new version that came in over the stream) would be freed up.
(Also, to bootstrap this system, I would need to scan the whole database to prefill all the state before I could start reading from a stream of edge-triggered updates. Is that a common pattern?)
I don't know enough about Beam to answer that part of the question, but you can certainly use Flink in the way you've described. The simplest way to accomplish this with Flink is with a streaming join, using the SQL/Table API. The runtime will materialize both tables into managed Flink state, and produce new and/or updated results as new and updated records are processed from the input tables. This is a commonly used pattern.
As for initially bootstrapping the state, before continuing to ingest the updates, I suggest using a CDC-based approach. You might start by looking at https://github.com/ververica/flink-cdc-connectors.

Replicate a database using snapshots and transaction logs

For learning purposes, I want to write my own database, that is able to replicate itself. I have made some progress, but now I am facing a problem that I can not solve. Supposed I have a database (let's call this source) that I would like to replicate to another database (let's call this target).
The basic principle is easy: In the source you don't store actual tables, but instead a log of transactions. It's easy to send over the transaction log to the target, where the database then rebuilds itself. If you want to update the target, you simply request the part of the transaction log that has changed ever since. Basically this is what almost every database does.
While this works, it has one major drawback: If a table already exists for a long time, the transaction log is very long, and hence replicating the table requires lots of timeā€¦
To avoid this you can store the current state as well. This means you have an up-to-date snapshot that you can copy fast. Additionally, the target has to subscribe to the transaction log of the source. Once it contains additional entries, the target applies them to its copied table. This works well, too, and it's way better in terms of performance and transferred volume.
But now I am facing a problem: Supposed the snapshot is large, then it may happen that changes are made to it while it is being delivered. That means that the copied snapshot contains some old and some new data. Now, how do I get the target database in a consistent state? Even if I know from where to start the transaction log, I either have to apply a change that was already applied to some of the records, or I have to leave it out, but then a change is not applied at all to some other records.
Of course I could use the isolation level sequential, but then performance drops. Of course I could do what e.g. CouchDB does and remember the current table revision in every record, and keep a copy of every record for every revision. But then the required space grows enormously.
So, what shall I do?
Everything that I was able to find on the web always either relies on the idea of replaying the entire transaction log, or by using a process as in CouchDB which takes up huge amounts of space.
Any ideas?
Your snapshot needs to be consistent and you need to know at what time (in regards to the tx log) it is consistent. You then apply any transactions that have been committed since this point.
Obtaining a consistent snapshot can be done with exclusive locking, which may delay other transactions from committing, or using row versions (MVCC).
Good luck with your project.

Preventing duplicates with MapReduce to BigQuery pipeline

I was reading the answer by Michael to this post here, which suggests using a pipeline to move data from datastore to cloud storage to big query.
Google App Engine: Using Big Query on datastore?
I want to use this technique to append data to a bigquery table. That means I have to have some way of knowing if the entities have been processed, so they don't get repeatedly submitted to bigquery during mapreduce runs. I don't want to rebuild my table each time.
The way I see it, I have two options. I can put a flag on the entities and update it when each entity is processed and filter it out on subsequent runs - or - I can save each entity to a new table and delete it from the source table. The second way seems superior but I wanted to ask for options or see if there's any gotchas
Assuming you have some stream of activity represented as entities, you can use query cursors to start up one query where a prior one left off. Query cursors are perfect for the type of incremental situation that you've described, because they avoid the overhead for marking entities as having been processed.
I'd have to poke around a bit to see if App Engine MapReduce supports cursors (I suspect that it doesn't, yet).

Resources