Is it possible for kafka streaming to capture the change in database view? I have a view in database with columns form several tables. So will kafka detect the data change in view
Out of the box, no, Kafka doesn't interact with any database.
If you can query a view using JDBC periodically, though, then you can use the apache-kafka-connect JDBC Source Connector to get those same rows of data as Kafka records.
Or you can use a CDC product such as debezium to stream out all individual relevant tables the view uses, and join them within KStreams/KSQL to recreate the entire materialized view table, but backed by a stream
Related
I need to sync data from several tables in a legacy SQL Server db (source) to a single table in a Postgres db (target). The schema of the source db is absurd, so the query to select the data takes a very long time to run. I'm planning to create an indexed view in the source db, and then somehow sync that indexed view to the Postgres table.
Right now, I simply have a scheduled task that drops the Postgres table (target) and then recreates it from scratch by running the complex query in the source db. This was quick to set up, and it ensures that changes in the source db always eventually make it to the target db, but recreating the table every few hours is (understandably) very slow and expensive. I need a way to replicate ongoing changes (only the new/updated data) from the source view to the target table. Is there a (relatively) simple way to do this?
I'm somewhat familiar with CDC, but I understand that CDC cannot be used on a view, so I don't believe that's an option. Adding "updated at" timestamps to the source tables is not an option, so I can't use that approach. I could add a hash column to the source tables, or maybe add a hash column to the view, so that's an option if that would work. Is there an existing tool/service that does what I need?
If you want to view SQL Server DB data in PostgreSQL, then you can also tds_fdw.
https://github.com/tds-fdw/tds_fdw
Also, there are some third-party tools which could help you to achieve your goal, for example, SymmetricDS
http://www.symmetricds.org/about/overview
I am trying to find a tool, or methodology to store when an update is done against an specific table and column in AWS Redshift.
In PostgreSQL there is a way of doing this with triggers, but Redshift does not support these triggers.
Can we monitor updates statements and store the timestamp, the old value, the new one, and the table affected?
There is no in-built capability in Amazon Redshift to do change detection.
Amazon Redshift is intended as a Data Warehouse, which typically means that bulk information is loaded from external sources. It should be relatively rare for data to be updated within Amazon Redshift because it is not intended to be used as an OLTP database.
Thus, it would be better to put change detection in the source database or in the ETL pipeline, rather than Redshift.
I'm currently looking into Kafka Connect to stream some of our databases to a data lake. To test out Kafka Connect I've setup a database with one of our project databases in. So far so good.
Next step I configured Kafka Connect with mode following properties:
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"timestamp.column.name": "updated_at,created_at",
"incrementing.column.name": "id",
"dialect.name": "SqlServerDatabaseDialect",
"validate.non.null": "false",
"tasks.max": "1",
"mode": "timestamp+incrementing",
"topic.prefix": "mssql-jdbc-",
"poll.interval.ms": "10000",
}
While this works for the majority of my tables where I got an ID and a created_at / updated_at field, it won't work for my tables where I solved my many-to-many relationships with a table in between and a composite key. Note that I'm using the generic JDBC configuration with a JDBC driver from Microsoft.
Is there a way to configure Kafka Connect for these special cases?
Instead of one connector to pull all of your tables, you may need to create multiple ones. This would be the case if you want to use different methods for fetching the data, or different ID/timestamp columns.
As #cricket_007 says, you can use the query option to pull back the results of a query—which could be a SELECT expressing your multi-table join. Even when pulling data from a single table object, the JDBC connector itself is just issuing a SELECT * from the given table, with a WHERE predicate to restrict the rows selected based on the incrementing ID/timestamp.
The alternative is to use log-based change data capture (CDC), and stream all changes directly from the database into Kafka.
Whether you use JDBC or log-based CDC, you can use stream processing to resolve joins in Kafka itself. An example of this is Kafka Streams or KSQL. I've written about the latter a lot here.
You might also find this article useful describing in detail your options for integrating databases with Kafka.
Disclaimer: I work for Confluent, the company behind the open-source KSQL project.
From this excellent video "Microservices Evolution: How to break your monolithic database by Edson Yanaga" I know that there are different ways to split chunk of data as separate db for microservice:
View
Materialized View
Mirror Table using Trigger
Mirror Table using Transactional Code
Mirror Table using ETL tools
Event Sourcing
Could you please explain me the difference between mirrored table and materialized view?
I'm confused due to both of them are stored on disk...
My understanding is :-
Mirrored tables
Mirrored tables are generally an exact copy of the another, source table. Same structure and the same data. Some database platforms allow triggers to be created on the source table which will perform updates on the source table to the mirror table. If the database platform does not provide this functionality, or if the Use Case dictates, you may perform the update in transactional code instead of a trigger.
Materialized Views
A Materialized View contains the result of a query. With a regular database view, when the underlying table data changes, querying the view reflects those changes. However, with a materialized view the data is current only at the point in time of creation (or refresh) of the Materialized view. In simple terms, a materialized view is a snapshot of data at a point in time.
I am into designing ETL with source and target database as oracle Standard Edition.
For ETL purpose I need to get the changed data everytime.Client does not want any changes to be made in source objects.
Is it feasible to create Materialized view log on source database using dblink to track Inser/Update/Delete on the identified tables.
Thanks and Regards
I do not believe so -- a materialized view log must be created in the same database as the source object. If the database link were unavailable, your materialized view log would then be incomplete or inaccurate, or worse yet, would be blocking DML against the source table.
I'd recommend instead either:
Accepting the overhead of a FULL vs
FAST refreshable materialized view; or
Implementing Streams-based replication
to have your own copy of the table(s) in question,
against which you then implement materialized view logs.