Kafka to ES and other sink DB - database

Using a mysql source connector, I can capture the mysql change and post to them ES or another db for backup. But for that I need separate connector(both source and sink) for each table I have in my source db.
So my question is:
Without creating same amount of source and sink connector for each tables, how can I achieve the same purpose? As this is cumbersome to create that many connectors. So that backing up db(replica) and making faster response service for clients would become much easier for me. Or there is no way to do this?

For the source connectors, you can use table.whitelist. For example,
table.whitelist: "User, Address, Email"
Sink connectors can only be configured for one table at a time.
And I wouldn't say that it is hard to maintain multiple sink/source connectors and topics. From my experience, it is harder to maintain connectors which are replicating data from multiple topics/sources. For example, if you want to apply SMT (Simple Message Transform) on a particular topic, you won't be able to achieve it if you don't have isolated connectors as SMTs are applied on a connector level. Furthermore, if you configure a single connector for all of your sources and at some point it fails, all of your target systems will encounter downtime.

Related

Why not write data as hudi or iceburg format in flink-table-store?

Recently I had a chance to get to know the flink-table-store project. I was attracted by the idea behind it at the first glance.
After reading the docs, I've got a question in my head for a while. It's about the design of the file storage.
It looks it can implemented based on the other popular open-source libraries other than creating a total new component (lsm tree based). Hudi or iceburg looks like a good choice, since they both support change logs saving and querying.
If do it like so, there is no need to create a component for other related computation engine (spark, hive or trinno) since they are already supported by hudi or iceburg. It looks like a better solution for me instead of create another wheel.
So, here is my questions. Is there any issue writing data as hudi or iceburg? Why not choose them in the first design decision?
Looking for design explanation.
Flink Data Store is a new project created to natively support update/delete operations on DFS tables using data snapshots.
These features are already available in Apache Hudi, the first open lakehouse format, Delta Lake, the lakehouse format developed and maintained by Databricks and Apache Iceberg which evolve quickly.
The table created with these tools can be queried from different tools/engines (Spark, Flink, Trino, Athena, Spectrum, Dremio, ...), but to support all these tools, they do some changes on the design which can affect the performance, while Flink Data Store is created and optimized for Flink, so it gives you the best performance with Apache Flink comparing with the other 3 projects.
Is there any issue writing data as hudi or iceberg?
Not at all, a lot of companies use Hudi and Iceberg with Spark, Flink and Trino in production, and they have no issues.
Why not choose them in the first design decision?
If you want to create tables readable by the other tools, you should avoid using Flink Data Store, and you need to choose between the other options, but the main idea of Flink Data Store was to create internal tables used to transform your streaming data, which is similar for KTables in kafka stream, so you can write your streaming data to Flink Data Store tables, transform them on multiple stage, and at the end, write the result to Hudi or Iceberg table to query it by the different tools

Is there any way in Spring Boot to commit same data to two different data sources without duplicating the repository?

I want to replicate neo4j data on two different neo4j instances installed on ec2. Is there anyway by which I can commit same data to two different neo4j instances?
I tried examples for committing on two different data sources as given here but they create separate repositories for each data source config which I don't want here. I have only one repository and when I commit, it should be written in both data sources. I cannot use enterprise edition of neo4j since it is costly. So I have to limit myself to community edition only.
In general, I would like to learn about this process irrespective of type of any database. It can be SQL or H2 or mongo, any db.
Essentially you are asking for a replication feature and as far as I remember replication is not available in free open source version of neo4j.
If you create two repositories, you will introduce the problems of a distributed system example: what if one write succeeds and the other one fails. If you really want this kind of architecture, you are better off using a RabbitMQ/Kafka and make is event driven system that works on publish/subscribe pattern. You will have multiple listeners that way that can update multiple ne04j instance, still not ideal by any way!!!!
I would suggest to look at neo4j alternatives like https://orientdb.com/ or buy commercial license of neo4j.

Should I abstract technical details from Kafka ETL?

What’s best practice to attach an ERP (or any other “standard software” that might come with complicated, technical data models) as a source to a kafka system?
Would you recommend to hide these technical details using views in that source database? Not sure, if a view can “trigger” the replication inside a JDBC source connector for oracle.
Another way I can think of is “joining” all these tables in kafka using its Stream API. This way, the source database system is not used to perform “ETL joins”.
You can start by creating (assuming one doesn't exist already) a Kafka Connect source connector for you ERP software. In it, try hiding the complexity of dealing with the technicalities of the ERP. You can also do some simple stateless conversions to abstract away from the specifics of the ERP model into your domain model. My experience tells me this won't be sufficient and the output will still be somewhat bound to the source model if it's complicated. Please remember to partition the data in a way that makes sense in your domain.
Next, once you have the data in kafka and partitioned well, you can do further processing in Kafka Streams. These have fault-tolerant (with Kafka v0.11 even transactional) stateful processing capabilities and allow doing joins between topics easily.

Synchronising data entities from different applications

I'm looking for some feedback on the best approach to a problem I've been tasked with. There are two systems with their own databases which store very similar business entities.
For each entity in question there needs to be a synchronization mechanism in place to make sure that changes in one database are delivered to the other when a change occurs and for the changes to be translated into the destination table structure. This translation means that replication is not an option but I don't want to start writing bespoke triggers or views etc to keep them in sync.
Is this something which BizTalk or a similar product could handle after an initial configuration / mapping process? Also, is Biztalk potentially overkill and are there any other methods which I could employee to achieve this?
Thanks,
Brian.
It depends on the size of the "systems" (tables ?) to synchronise.
EAI are the general application to do this. Connecting two systems which can't interact together, effectivly mapping one business object to another one, aplling a map to translate one into another.
But such tools (like webMethods for exemple) are entreprise tools, if you only need to synchronise two table from two systems EAI will clearly be overkill.
Anyway the principles can help you. The EAI approach would be to have a generic business object that's match all of properties found in both systems for the business objects you want to syncrhonise. Then you will have to have some sort of map to translate each application specific business objet to and from you generic business object. Your object should not only describe the business data, but also the operation to perform (create, update, delete data).
Then you need a trigger (or two if you want to synchronyze both ways) to detect when a change happen, use the map to transform the data your trigger get to generic object (with the operation to perform at the other end).
And finally you need an "updater" that will take the specific business object and do the right operation in the database (insert/update/delete)
EAI provide connectors to take care of triggering the workflow and updating the database. You will still need to define some mappings in some specific way depending of the EAI used.
EAI are a lot more powerfull than juste synchronizing two tables. Connnectors have various type and can interact with various system (proprietary ones), various database, simple format (xml, text) or specific protocols (ftp, webservices, etc.)
EAI also ensure that any modification is effectivly commited at the end.
Hope it helps.
Sql Server Integration Services could be a cheep candidate for solving the problem (can connect to other DBs and data sources that Sql Server). SSIS is part of all Sql Server installations (with the exception of Express).
There is a nifty tool called "datariver" by the Swiss company Sowatec (where I did work a few years ago. I wasn't involved with this product though; just so you know). It's meant to flow data from sources to sinks (just like a river).
The web site is in German but the guys behind it are happy to answer any of your questions in English by mail.
BizTalk is and would be an ideal solution for this kind of problem.
What BizTalk can do?
1. Define a schema which represents a common business entity, this is essentially all the fields which need to be in sync across several database tables.
Define the flow of communication (Orchestrations) and end-points(web services), i.e. which update triggers what changes!
Use maps to map the common business entity into specific data elements required
by the databases. Note that biztalk has built-in adapters to speed up the development process.
Adequate time must be spent in design and of this system the results would be fabulous.
For development purposes refer my articles (google keywords: Biztalk + Karamchetti)

What is the best approach for decoupled database design in terms of data sharing?

I have a series of Oracle databases that need to access each other's data. The most efficient way to do this is to use database links - setting up a few database links I can get data from A to B with the minimum of fuss. The problem for me is that you end up with a tightly-coupled design and if one database goes down it can bring the coupled databases with it (or perhaps part of an application on those databases).
What alternative approaches have you tried for sharing data between Oracle databases?
Update after a couple of responses...
I wasn't thinking so much a replication, more on accessing "master data". For example, if I have a central database with currency conversion rates and I want to pull a rate into a separate database (application). For such a small dataset igor-db's suggestion of materialized views over DB links would work beautifully. However, when you are dynamically sampling from a very large dataset then the option of locally caching starts to become trickier. What options would you go for in these circumstances. I wondered about an XML service but tuinstoel (in a comment to le dorfier's reply) rightly questioned the overhead involved.
Summary of responses...
On the whole I think igor-db is closest, which is why I've accepted that answer, but I thought I'd add a little to bring out some of the other answers.
For my purposes, where I'm looking at data replication only, it looks like Oracle BASIC replication (as opposed to ADVANCED) replication is the one for me. Using materialized view logs on the master site and materialized views on the snapshot site looks like an excellent way forward.
Where this isn't an option, perhaps where the data volumes make full table replication an issue, then a messaging solution seems the most appropriate Oracle solution. Oracle Advanced Queueing seems the quickest and easiest way to set up a messaging solution.
The least preferable approach seems to be roll-your-own XML web services but only where the relative ease of Advanced Queueing isn't an option.
Streams is the Oracle replication technology.
You can use MVs over database links (so database 'A' has a materialized view of the data from database 'B'. If 'B' goes down, the MV can't be refreshed but the data is still in 'A').
Mileage may depend on DB volumes, change volumes...
It looks to me like it's by definition tightly coupled if you need simultaneous synchronous access to multiple databases.
If this is about transferring data, for instance, and it can be asynchronous, you can install a message queue between the two and have two processes, with one reading from the source and the other writing to the sink.
The OP has provided more information. He states that the dataset is very large. Well how large is large? And how often are the master tables changed?
With the use of materialized view logs Oracle will only propagate the changes made in the master table. A complete refresh of the data isn't necessary. Oracle streams also only communicate the modifications to the other side.
Buying storage is cheap, so why not local caching? Much cheaper than programming your own solutions.
An XML service doesn't help you when its database is not available so I don't understand why it would help? Oracle has many options for replication, explore them.
edit
I've build xml services. They provide interoperability between different systems with a clear interface (contract). You can build a xml service in C# and consume the service with Java. However xml services are not fast.
Why not use Advanced Queuing? Why roll your own XML service to move messages (DML) between Oracle instances - It's already there. You can have propagation move messages from one instance to another when they are both up. You can process them as needed in the destination servers. AQ is really rather simple to set up and use.
Why do they need to be separate databases?
Having a single database/instance with multiple schemas might be easier.
Keeping one database up (with appropriate standby databases etc) will be easier than keeping N up.
What kind of immediacy do you need and how much bi-directionality? If the data can be a little older and can be pulled from one "master source", create a series of simple ETL scripts run on a schedule to pull the data from the "source" database into the others.
You can then tailor the structure of the data to feed the needs of the client database(s) more precisely and you can change the structure of the source data until you're blue in the face.

Resources