How are changes count calculated in Azure Data Factory - Change Data Capture - sql-server

With the announcing of change data capture in ADF comes various questions. I tried hand's on the same, and came across various scenarios.
Implemented multiple tables from source to target, where source was On-premises SQL Server and sink was Azure SQL Database.
In monitor tab I tried to read changes read and write but didn't get how that are counted while INSERT, UPDATE, DELETE operation.
If I'm inserting the single data in the source table, the changes read in the monitor tab is displaying 4 changes.
And when I perform Delete operation, that change is not read and written.
So, overall, I'm facing difficulty how the changes count are calculated. Can anybody explain this process of count calculation.
Please find below the screenshot for the same:-
(https://i.stack.imgur.com/iLtT5.png)

To support upsert/delete operations you need to choose keys columns in column mapping, can you try selecting these options and try.
In Monitoring tab, currently we aggregate the entire changes read/written across sources/sinks.

Related

How do you optimize query performance on marketplace shared views in snowflake? Note that incremental materialization probably would not work

I am working with data from Snowflake Marketplace, these are large multi-billion record tables.
I have two conflicting needs: speed & up-to-date data
I am able to have up-to-date data by working exclusively with views - meaning the data is up to date from the vendor's perspective at the moment I make a query. However, performance is terrible (the vendor does not cluster their tables the way I would do it)
I can also materialize copies of the tables with my chosen cluster keys. This works great for performance, but it introduces a 10-20h lag every time the tables are updated - which is not good.
My main issue is that this data is "changed" all the time. Ie current and historical values are updated by the vendor in-place (not append). This makes incremental runs almost impossible.
Does Snowflake have any feature that could help in this context?
You mentioned that it's a table, not a view that they're sharing. You can request permission from the vendor of this data to be able to place a stream on their table. Once you have a stream on their table, you'll get all the rows you need to complete a synchronization of their table changes with your local copy. This should reduce the 10-20h lag because without setting a stream on their side you'll wind up doing full refreshes. This approach will allow you to handle incremental changes.
When you try to create a stream on a shared table, unless you've already arranged it with the vendor or the vendor has already enabled this for another share consumer, you may get this message:
SQL access control error: Insufficient privileges to operate on stream
source without CHANGE_TRACKING enabled 'MY_TABLE'
This just means the sharing vendor must enable change tracking on their side. On the sharing account side:
alter table MY_TABLE set change_tracking = true;
As soon as they make that change, any and all sharing consumers will be able to create a stream on the table:
status
Stream MY_STREAM successfully created.

idiomatic way to do many dynamic filtered views of a Flink table?

I would like to create a per-user view of data tables stored in Flink, which is constantly updated as changes happen to the source data, so that I can have a constantly updating UI based on a toChangelogStream() of the user's view of the data. To do that, I was thinking that I could create an ad-hoc SQL query like SELECT * FROM foo WHERE userid=X and convert it to a changelog stream, which would have a bunch of inserts at the beginning of the stream to give me the initial state, followed by live updates after that point. I would leave that query running as long as the user is using the UI, and then delete the table when the user's session ends. I think this is effectively how the Flink SQL client must work, so it seem like this is possible.
However, I anticipate that there may be some large overheads associated with each ad hoc query if I do it this way. When I write a SQL query, based on the answer in Apache Flink Table 1.4: External SQL execution on Table possible?, it sounds like internally this is going to compile a new JAR file and create new pipeline stages, I assume using more JVM metaspace for each user. I can have tens of thousands of users using the UI at once, so I'm not sure that's really feasible.
What's the idiomatic way to do this? The other ways I'm looking at are:
I could maybe use queryable state since I could group the current rows behind the userid as the key, but as far as I can tell it does not provide a way to get a changelog stream, so I would have to constantly re-query the state on a periodic basis, which is not ideal for my use case (the per-user state can be large sometimes but doesn't change quickly).
Another alternative is to output the table to both a changelog stream sink and an external RDBMS sink, but if I do that, what's the best pattern for how to join those together in the client?

can (should) I use Flink like an in-memory database?

I've used batch Beam but am new to the streaming interface. I'm wondering about the appropriateness of using Apache Flink / Beam kind of like an in-memory database -- I'd like to constantly recompute and materialize one specific view of my data based on edge triggered updates.
More details: I have a few tables in a (normal) database, ranging from thousands to millions of rows, and each one has a many-to-many (M2M) relationship with other ones. Picture to explain:
Hosts <-M2M #1-> Table 1 <-M2M #2-> Table 2 <-M2M #3-> Table 3
Table 1 is a set of objects that the hosts need to know about, and each host needs to know about all downstream rows referenced directly or indirectly by the objects in Table 1 that it's related to. When changes happen anywhere other than the first many-to-many relationship M2M #1, it's not obvious which hosts need to be updated without traversing "left" to find the hosts and then traversing "right" to get all the necessary configuration. The objects and relationships at most levels change frequently, and I need sub-second latency to go from "a record or relationship changed" to recalculating any flattened config files with changes in them so that I can push updates to the hosts very quickly.
Is this an appropriate use case for streaming Flink / Beam? I have worked with Beam in a different system but only in batch mode, and I think that it would be a great tool to use here if I could edge-trigger it. The part I'm getting stuck on is, in batch mode, the PCollections are all "complete" in the sense that I can always join all records in one table with all records in another table. But with streaming, once I process a record once, it gets removed from its PCollection and can't be joined against future updates that arrive later on and relate to it, right? IIUC, it's only available within a window, but I effectively want an infinitely long window where only outdated versions of items in a PCollection (e.g. versions of them which have been overwritten by a new version that came in over the stream) would be freed up.
(Also, to bootstrap this system, I would need to scan the whole database to prefill all the state before I could start reading from a stream of edge-triggered updates. Is that a common pattern?)
I don't know enough about Beam to answer that part of the question, but you can certainly use Flink in the way you've described. The simplest way to accomplish this with Flink is with a streaming join, using the SQL/Table API. The runtime will materialize both tables into managed Flink state, and produce new and/or updated results as new and updated records are processed from the input tables. This is a commonly used pattern.
As for initially bootstrapping the state, before continuing to ingest the updates, I suggest using a CDC-based approach. You might start by looking at https://github.com/ververica/flink-cdc-connectors.

Quering a huge database using cfquery

Well, I am going to query a 4 GB data using a cfquery. It's gonna be pain to query
the whole database as it's gonna take very long time to get the data back.
I tried stored procedure when the data was 2 GB and it wasn't really fast at that time either.
The data pulling will be done based on the date range user is gonna select from a HTML page.
I have been suggested to follow data archiving in order to speed up querying the database.
Do you think that I'll have to create a separate table with only fields that are required and then query this newly created table?
Well, the size of the current table is 4GB but it is increasing day by day, basically, it's a response database ( getting the information stored from somewhere
else). After doing some research, I am wondering if writing a Trigger could be one option? So, if I do this, then as soon as a new entry (row) will be added
into the current 4GB table , the trigger will initiate some SQL Query which will transfer the contents of the required fields into the newly created table.
This will keep on happening as long as I keep on getting new values in my original 4GB database.
Does above approach sounds good enough to tackle my problem? I have one more concern, even though I am filtering out the only fields required to querying into
a new table, at some point of time, the size of my new database will also increase and that could alsow slower the speed of querying the new table?
Please correct me if Iam wrong somewhere.
Thanks
More Information:
I am using SQL Server. Indexing is currently done but it's not effective.
Archiving the data will be farting against thunder. The data has to travel from your database to your application. Then your application has to process it to build the chart. The more data you have, the longer that will take.
If it is really necessary to chart that much data, you might want to simply acknowledge that your app will be slow and do things to deal with it. This includes code to prevent multiple page requests, displays to the user, and such.

Versioning a dataset in an RDBMS using initials and deltas

I'm working on a system that mirrors remote datasets using initials and deltas. When an initial comes in, it mass deletes anything preexisting and mass inserts the fresh data. When a delta comes in, the system does a bunch of work to translate it into updates, inserts, and deletes. Initials and deltas are processed inside long transactions to maintain data integrity.
Unfortunately the current solution isn't scaling very well. The transactions are so large and long running that our RDBMS bogs down with various contention problems. Also, there isn't a good audit trail for how the deltas are applied, making it difficult to troubleshoot issues causing the local and remote versions of the dataset to get out of sync.
One idea is to not run the initials and deltas in transactions at all, and instead to attach a version number to each record indicating which delta or initial it came from. Once an initial or delta is successfully loaded, the application can be alerted that a new version of the dataset is available.
This just leaves the issue of how exactly to compose a view of a dataset up to a given version from the initial and deltas. (Apple's TimeMachine does something similar, using hard links on the file system to create "view" of a certain point in time.)
Does anyone have experience solving this kind of problem or implementing this particular solution?
Thanks!
have one writer and several reader databases. You send the write to the one database, and have it propagate the exact same changes to all the other databases. The reader databases will be eventually consistent and the time to update is very fast. I have seen this done in environments that get upwards of 1M page views per day. It is very scalable. You can even put a hardware router in front of all the read databases to load balance them.
Thanks to those who tried.
For anyone else who ends up here, I'm benchmarking a solution that adds a "dataset_version_id" and "dataset_version_verb" column to each table in question. A correlated subquery inside a stored procedure is then used to retrieve the current dataset_version_id when retrieving specific records. If the latest version of the record has dataset_version_verb of "delete", it's filtered out of the results by a WHERE clause.
This approach has an average ~ 80% performance hit so far, which may be acceptable for our purposes.

Resources