Should I abstract technical details from Kafka ETL? - database

What’s best practice to attach an ERP (or any other “standard software” that might come with complicated, technical data models) as a source to a kafka system?
Would you recommend to hide these technical details using views in that source database? Not sure, if a view can “trigger” the replication inside a JDBC source connector for oracle.
Another way I can think of is “joining” all these tables in kafka using its Stream API. This way, the source database system is not used to perform “ETL joins”.

You can start by creating (assuming one doesn't exist already) a Kafka Connect source connector for you ERP software. In it, try hiding the complexity of dealing with the technicalities of the ERP. You can also do some simple stateless conversions to abstract away from the specifics of the ERP model into your domain model. My experience tells me this won't be sufficient and the output will still be somewhat bound to the source model if it's complicated. Please remember to partition the data in a way that makes sense in your domain.
Next, once you have the data in kafka and partitioned well, you can do further processing in Kafka Streams. These have fault-tolerant (with Kafka v0.11 even transactional) stateful processing capabilities and allow doing joins between topics easily.

Related

Kafka to ES and other sink DB

Using a mysql source connector, I can capture the mysql change and post to them ES or another db for backup. But for that I need separate connector(both source and sink) for each table I have in my source db.
So my question is:
Without creating same amount of source and sink connector for each tables, how can I achieve the same purpose? As this is cumbersome to create that many connectors. So that backing up db(replica) and making faster response service for clients would become much easier for me. Or there is no way to do this?
For the source connectors, you can use table.whitelist. For example,
table.whitelist: "User, Address, Email"
Sink connectors can only be configured for one table at a time.
And I wouldn't say that it is hard to maintain multiple sink/source connectors and topics. From my experience, it is harder to maintain connectors which are replicating data from multiple topics/sources. For example, if you want to apply SMT (Simple Message Transform) on a particular topic, you won't be able to achieve it if you don't have isolated connectors as SMTs are applied on a connector level. Furthermore, if you configure a single connector for all of your sources and at some point it fails, all of your target systems will encounter downtime.

Anylogic 7. Databases modelling

I am looking for the best way to model a database system.
It should be made of nodes, edges and data query flows.
I know there is a flow lib, but i dont sure that it is usable for such things.
So, the question is: is there any libs that i could use for this purpose? Or i should mostly use my own types, agents etc.?
The fuild library (if you meant that) is not useful for that purpose.
If you want to model the flow of data through a system of nodes, you might want to start with a simple process-modelling approach where data items are agents flowing through queues, delays and service objects...
However, depending on what your database system is doing (I am no expert there), you might actually need to switch to a pure agent-based approach sooner or later (i.e. replace the process library objects with your own functionality).
In short: start with process modelling and introduce agent-based functionality over time...
If you are new to AnyLogic I suggest you follow the logic in the tutorial for agent based modeling. Look at it as if the distributor is your server, the retailers your clients and the orders your queries. You can use GIS maps if you are concerned about the real location of servers and clients or use other network capabilities (or agent connections) if the actual locations are not important in your model.

Database Bottleneck In Distributed Application

I hear about SOA and Distributed Applications everywhere now. I would like know about some best practices related to keeping the single data source responsive or in case if you have copy of data on every server how it is better to synchronise those databases to keep them updated ?
There are many answers to this question and in order to choose the most appropriate solution, you need to carefully consider what kind of data you are storing and what you want to do with it.
Replication
This is the traditional mechanism for many RDBMS, and normally relies on features provided by the RDBMS. Replication has a latency which means although servers can handle load independently, they may not necessarily be reading the latest data. This may or may not be a problem for a particular system. When replication is bidirectional then simultaneous changes on two databases can lead to conflicts that need resolving somehow. Depending on your data, the choice might be easy (i.e. audit log => append both), or difficult (i.e. hotel room booking - cancel one? select alternative hotel?). You also have to consider what to do in the event that the replication network link is down (i.e. do you deny updates on both database, one database or allow the databases to diverge and sort out the conflicts later). This is all dependent on the exact type of data you have. One possible compromise, for read-heavy systems, is to use unidirectional replication to many databases for reading, and send all write operations to the source database. This is always a trade-off between Availability and Consistency (see CAP Theorem). The advantage of RDBMS and replication is that you can easily query your entire dataset in complex ways and have greater opportunity to
remove duplication by using relational links to data items.
Sharding
If your data can be cleanly partitioned into disjoint subsets (e.g. different customers), such that all possible relational links between data items are contained within each subset (e.g. customers -> orders). Then you can put each subset in separate databases. This is the principle behind NoSQL databases, or as Martin Fowler calls them 'Aggregate-Oriented Databases'. The downside of this approach is that it requires more work to run queries over your entire dataset, as you have to query all your databases and then combine the results (e.g. map-reduce). Another disadvantage is that in separating your data you may need to duplicate some (e.g. sharding by customers -> orders might mean product data is duplicated). It is also hard to manage the data schema as it lies independently on multiple databases, which is why most NoSQL databases are schema-less.
Database-per-service
In the microservice approach, it is advised that each microservice should have its own dedicated database, that is not allowed to be accessed by any other microservice (of a different type). Hence, a microservice that manages customer contact information stores the data in a separate database from the microservice that manages customer orders. Links can be made between the databases using globally unique ids, or URIs (especially if the microservices are RESTful) etc. The downside again from this is that it is even harder to perform complex queries on the entire dataset (especially since all access should go via the microservice API not direct to the databases).
Polyglot storage
So many of my projects in the past have involved a single RDBMS in which all data was placed. Some of this data was well suited to the relational model, much of it was not. For example, hierarchical data might be better stored in a graph database, stock ticks in a column-oriented database, html templates in a NoSQL database. The trend with micro-services is to move towards a model where different parts of your dataset are placed in storage providers that are chosen according to the need.
If you thinking to keep different copies of the database for each microservice and you want to achieve eventual consistency than you can use Kafka Connect. I can briefly tell you that kafka connect will watch your DBS and whenever there are any changes it will read the log file and will add these logged events as a message in Queue then another database those are a subscriber to this Queue can execute the same statement at their side also.
Kafka connect isn't the only framework, you can search and find other frameworks or application for the same implementation.

Synchronising data entities from different applications

I'm looking for some feedback on the best approach to a problem I've been tasked with. There are two systems with their own databases which store very similar business entities.
For each entity in question there needs to be a synchronization mechanism in place to make sure that changes in one database are delivered to the other when a change occurs and for the changes to be translated into the destination table structure. This translation means that replication is not an option but I don't want to start writing bespoke triggers or views etc to keep them in sync.
Is this something which BizTalk or a similar product could handle after an initial configuration / mapping process? Also, is Biztalk potentially overkill and are there any other methods which I could employee to achieve this?
Thanks,
Brian.
It depends on the size of the "systems" (tables ?) to synchronise.
EAI are the general application to do this. Connecting two systems which can't interact together, effectivly mapping one business object to another one, aplling a map to translate one into another.
But such tools (like webMethods for exemple) are entreprise tools, if you only need to synchronise two table from two systems EAI will clearly be overkill.
Anyway the principles can help you. The EAI approach would be to have a generic business object that's match all of properties found in both systems for the business objects you want to syncrhonise. Then you will have to have some sort of map to translate each application specific business objet to and from you generic business object. Your object should not only describe the business data, but also the operation to perform (create, update, delete data).
Then you need a trigger (or two if you want to synchronyze both ways) to detect when a change happen, use the map to transform the data your trigger get to generic object (with the operation to perform at the other end).
And finally you need an "updater" that will take the specific business object and do the right operation in the database (insert/update/delete)
EAI provide connectors to take care of triggering the workflow and updating the database. You will still need to define some mappings in some specific way depending of the EAI used.
EAI are a lot more powerfull than juste synchronizing two tables. Connnectors have various type and can interact with various system (proprietary ones), various database, simple format (xml, text) or specific protocols (ftp, webservices, etc.)
EAI also ensure that any modification is effectivly commited at the end.
Hope it helps.
Sql Server Integration Services could be a cheep candidate for solving the problem (can connect to other DBs and data sources that Sql Server). SSIS is part of all Sql Server installations (with the exception of Express).
There is a nifty tool called "datariver" by the Swiss company Sowatec (where I did work a few years ago. I wasn't involved with this product though; just so you know). It's meant to flow data from sources to sinks (just like a river).
The web site is in German but the guys behind it are happy to answer any of your questions in English by mail.
BizTalk is and would be an ideal solution for this kind of problem.
What BizTalk can do?
1. Define a schema which represents a common business entity, this is essentially all the fields which need to be in sync across several database tables.
Define the flow of communication (Orchestrations) and end-points(web services), i.e. which update triggers what changes!
Use maps to map the common business entity into specific data elements required
by the databases. Note that biztalk has built-in adapters to speed up the development process.
Adequate time must be spent in design and of this system the results would be fabulous.
For development purposes refer my articles (google keywords: Biztalk + Karamchetti)

What is the best approach for decoupled database design in terms of data sharing?

I have a series of Oracle databases that need to access each other's data. The most efficient way to do this is to use database links - setting up a few database links I can get data from A to B with the minimum of fuss. The problem for me is that you end up with a tightly-coupled design and if one database goes down it can bring the coupled databases with it (or perhaps part of an application on those databases).
What alternative approaches have you tried for sharing data between Oracle databases?
Update after a couple of responses...
I wasn't thinking so much a replication, more on accessing "master data". For example, if I have a central database with currency conversion rates and I want to pull a rate into a separate database (application). For such a small dataset igor-db's suggestion of materialized views over DB links would work beautifully. However, when you are dynamically sampling from a very large dataset then the option of locally caching starts to become trickier. What options would you go for in these circumstances. I wondered about an XML service but tuinstoel (in a comment to le dorfier's reply) rightly questioned the overhead involved.
Summary of responses...
On the whole I think igor-db is closest, which is why I've accepted that answer, but I thought I'd add a little to bring out some of the other answers.
For my purposes, where I'm looking at data replication only, it looks like Oracle BASIC replication (as opposed to ADVANCED) replication is the one for me. Using materialized view logs on the master site and materialized views on the snapshot site looks like an excellent way forward.
Where this isn't an option, perhaps where the data volumes make full table replication an issue, then a messaging solution seems the most appropriate Oracle solution. Oracle Advanced Queueing seems the quickest and easiest way to set up a messaging solution.
The least preferable approach seems to be roll-your-own XML web services but only where the relative ease of Advanced Queueing isn't an option.
Streams is the Oracle replication technology.
You can use MVs over database links (so database 'A' has a materialized view of the data from database 'B'. If 'B' goes down, the MV can't be refreshed but the data is still in 'A').
Mileage may depend on DB volumes, change volumes...
It looks to me like it's by definition tightly coupled if you need simultaneous synchronous access to multiple databases.
If this is about transferring data, for instance, and it can be asynchronous, you can install a message queue between the two and have two processes, with one reading from the source and the other writing to the sink.
The OP has provided more information. He states that the dataset is very large. Well how large is large? And how often are the master tables changed?
With the use of materialized view logs Oracle will only propagate the changes made in the master table. A complete refresh of the data isn't necessary. Oracle streams also only communicate the modifications to the other side.
Buying storage is cheap, so why not local caching? Much cheaper than programming your own solutions.
An XML service doesn't help you when its database is not available so I don't understand why it would help? Oracle has many options for replication, explore them.
edit
I've build xml services. They provide interoperability between different systems with a clear interface (contract). You can build a xml service in C# and consume the service with Java. However xml services are not fast.
Why not use Advanced Queuing? Why roll your own XML service to move messages (DML) between Oracle instances - It's already there. You can have propagation move messages from one instance to another when they are both up. You can process them as needed in the destination servers. AQ is really rather simple to set up and use.
Why do they need to be separate databases?
Having a single database/instance with multiple schemas might be easier.
Keeping one database up (with appropriate standby databases etc) will be easier than keeping N up.
What kind of immediacy do you need and how much bi-directionality? If the data can be a little older and can be pulled from one "master source", create a series of simple ETL scripts run on a schedule to pull the data from the "source" database into the others.
You can then tailor the structure of the data to feed the needs of the client database(s) more precisely and you can change the structure of the source data until you're blue in the face.

Resources