Synchronising data entities from different applications - database

I'm looking for some feedback on the best approach to a problem I've been tasked with. There are two systems with their own databases which store very similar business entities.
For each entity in question there needs to be a synchronization mechanism in place to make sure that changes in one database are delivered to the other when a change occurs and for the changes to be translated into the destination table structure. This translation means that replication is not an option but I don't want to start writing bespoke triggers or views etc to keep them in sync.
Is this something which BizTalk or a similar product could handle after an initial configuration / mapping process? Also, is Biztalk potentially overkill and are there any other methods which I could employee to achieve this?
Thanks,
Brian.

It depends on the size of the "systems" (tables ?) to synchronise.
EAI are the general application to do this. Connecting two systems which can't interact together, effectivly mapping one business object to another one, aplling a map to translate one into another.
But such tools (like webMethods for exemple) are entreprise tools, if you only need to synchronise two table from two systems EAI will clearly be overkill.
Anyway the principles can help you. The EAI approach would be to have a generic business object that's match all of properties found in both systems for the business objects you want to syncrhonise. Then you will have to have some sort of map to translate each application specific business objet to and from you generic business object. Your object should not only describe the business data, but also the operation to perform (create, update, delete data).
Then you need a trigger (or two if you want to synchronyze both ways) to detect when a change happen, use the map to transform the data your trigger get to generic object (with the operation to perform at the other end).
And finally you need an "updater" that will take the specific business object and do the right operation in the database (insert/update/delete)
EAI provide connectors to take care of triggering the workflow and updating the database. You will still need to define some mappings in some specific way depending of the EAI used.
EAI are a lot more powerfull than juste synchronizing two tables. Connnectors have various type and can interact with various system (proprietary ones), various database, simple format (xml, text) or specific protocols (ftp, webservices, etc.)
EAI also ensure that any modification is effectivly commited at the end.
Hope it helps.

Sql Server Integration Services could be a cheep candidate for solving the problem (can connect to other DBs and data sources that Sql Server). SSIS is part of all Sql Server installations (with the exception of Express).

There is a nifty tool called "datariver" by the Swiss company Sowatec (where I did work a few years ago. I wasn't involved with this product though; just so you know). It's meant to flow data from sources to sinks (just like a river).
The web site is in German but the guys behind it are happy to answer any of your questions in English by mail.

BizTalk is and would be an ideal solution for this kind of problem.
What BizTalk can do?
1. Define a schema which represents a common business entity, this is essentially all the fields which need to be in sync across several database tables.
Define the flow of communication (Orchestrations) and end-points(web services), i.e. which update triggers what changes!
Use maps to map the common business entity into specific data elements required
by the databases. Note that biztalk has built-in adapters to speed up the development process.
Adequate time must be spent in design and of this system the results would be fabulous.
For development purposes refer my articles (google keywords: Biztalk + Karamchetti)

Related

Is it possible to produce Kafka messages containing serialized generic (unknown) C# objects and deserialize them by inferring type in the consumer?

EDIT: From the initial comments, it seems I may be barking up the wrong tree. I initially suggested using replication, but according to the networking team that is not possible because the two databases are on different virtual networks. It doesn't seem like this sort of thing would ever be "impossible", but I don't have the in-depth knowledge required to "fight back". Is there anything specific I should look into regarding SQL replication over the internet/different virtual networks, or points that I could bring up?
I've got an application that uses EF to save changes to an MS SQL database. For reasons, I need to be able to capture all changes made through the application and commit those exact DB changes to another, remote MS SQL db, using Kafka as the communication hub. The solution contains a ton of data entities, so ideally I'm looking for a generic way to capture these changes from the application and apply them to the remote DB where appropriate. The consuming application that is reading the messages from Kafka will also be applying the changes via EF (at least for my purposes).
So, is it possible to do something like this without having to strongly type everything for my hundreds of entities? What are the best practices for doing such a thing? Do I produce messages containing the new entity and the change state (added, updated, deleted)? I'm sure it's a bad idea for some reason, but what about serializing the entire EF ChangeManager and sending that, consolidating all of the Add/Update/Deletes and allowing us to make only a single call vs a bunch?
If I need to provide any more information please just let me know, I'll be monitoring this thread closely.

Anylogic 7. Databases modelling

I am looking for the best way to model a database system.
It should be made of nodes, edges and data query flows.
I know there is a flow lib, but i dont sure that it is usable for such things.
So, the question is: is there any libs that i could use for this purpose? Or i should mostly use my own types, agents etc.?
The fuild library (if you meant that) is not useful for that purpose.
If you want to model the flow of data through a system of nodes, you might want to start with a simple process-modelling approach where data items are agents flowing through queues, delays and service objects...
However, depending on what your database system is doing (I am no expert there), you might actually need to switch to a pure agent-based approach sooner or later (i.e. replace the process library objects with your own functionality).
In short: start with process modelling and introduce agent-based functionality over time...
If you are new to AnyLogic I suggest you follow the logic in the tutorial for agent based modeling. Look at it as if the distributor is your server, the retailers your clients and the orders your queries. You can use GIS maps if you are concerned about the real location of servers and clients or use other network capabilities (or agent connections) if the actual locations are not important in your model.

Using DTO Pattern to synchronize two schemas?

I need to synchronize two databases.
Those databases stores same semantic objects but physically different across two databases.
I plan to use a DTO Pattern to uniformize object representation :
DB ----> DTO ----> MAPPING (Getters / Setters) ----> DTO ----> DB
I think it's a better idea than physically synchronize using SQL Query on each side, I use hibernate to add abstraction, and synchronize object.
Do you think, it's a good idea ?
Nice reference above to Hitchhiker's Guide.
My two cents. You need to consider using the right tool for the job. While it is compelling to write custom code to solve this problem, there are numerous tools out there that already do this for you, map source to target, do custom tranformations from attribute to attribute and will more than likely deliver with faster time to market.
Look to ETL tools. I'm unfamiliar with the tools avaialable in the open source community but if you lean in that direction, I'm sure you'll find some. Other tools you might look at are: Informatica, Data Integrator, SQL Server Integration Services and if you're dealing with spatial data, there's another called Alteryx.
Tim
Doing that with an ORM might be slower by order of magnitude than a well-crafted SQL script. It depends on the size of the DB.
EDIT
I would add that the decision should depend on the amount of differences between the two schemas, not your expertise with SQL. SQL is so common that developers should be able to write simple script in a clean way.
SQL has also the advantage that everybody know how to run the script, but not everybody will know how to run you custom tool (this is a problem I encountered in practice if migration is actually operated by somebody else).
For schemas which only slightly differ (e.g. names, or simple transformation of column values), I would go for SQL script. This is probably more compact and straightforward to use and communicate.
For schemas with major differences, with data organized in different tables or complex logic to map some value from one schema to the other, then a dedicated tool may make sense. Chances are the the initial effort to write the tool is more important, but it can be an asset once created.
You should also consider non-functional aspects, such as exception handling, logging of errors, splitting work in smaller transaction (because there are too many data), etc.
SQL script can indeed become "messy" under such conditions. If you have such constraints, SQL will require advanced skills and tend to be hard to use and maintain.
The custom tool can evolve into a mini-ETL with ability to chunck the work in small transactions, manage and log errors nicely, etc. This is more work, and can result in being a dedicated project.
The decision is yours.
I have done that before, and I thought it was a pretty solid and straightforward way to map between 2 DBs. The only downside is that any time either database changes, I had to update the mapping logic, but it's usually pretty simple to do.

User defined data objects - what is the best data storage strategy?

I am building a system that allows front-end users to define their own business objects. Defining a business object involves creating data fields for that business object and then relating it to other business objects in the system - fairly straight forward stuff. My question is, what is the most efficient storage strategy?
The requirements are:
Must support business objects with potentially 100+ fields (of all common data types)
The system will eventually support hundreds of thousands of business object instances
Business objects sometimes display data and aggregates from their relationships with other business objects
Users must be able to search for business objects by their data fields (and fields from related business objects)
The two possible solutions I can envisage are:
Have a dynamic schema such that when a new business object type is created a new table is created for storing instances of that object. The object's fields become columns in the storage table.
Have a fixed schema where instance data fields are stored as rows in basically a big long table.
I can see pros and cons to both approaches:
the dynamic schema allows me to index search columns
the dynamic tables are potentially limited in width by the max column size
dynamic schemas rule out / cause issues with replication
the static schema means less or even no dynamic sql generation
my guess is the static schema may perform like a dog when it comes to searching across 100,000+ objects
So what is the best soution? Is there another approach I haven't thought of?
Edit: The requirement I have been given is to build a generic system capable of supporting front-end user defined business objects. There will of course be restrictions on how these objects can be constructed and related, but the requirement itself is not up for negotiation.
My client is a service provider and requires a degree of flexibility in servicing their own clients, hence the need to create business objects.
I think your problem matches very well to a graph database like Neo4j, as it's built for the requested kind of flexibility from the beginning. It stores data as nodes and relationships/edges, and both nodes and relationships can hold arbitrary properties (in a key/value fashion). One important difference to a RDBMS is that a graph database won't need to lookup the relationships in a big long table (like in your fixed schema solution), so there should be a significant performance gain there. You can find out about language bindings for Neo4j in the wiki and read what others say about it in this stackoverflow thread. Disclaimer: I'm part of the Neo4j team.
Without much understanding of your situation...
Instead of writing a general purpose one-size-fits-all business objects system (which is the holy grail for Oracle, Microsoft, SAS, etc.), why not do it the typical way, where the requirements are gathered, and a developer designs and implements the users' business objects in an effective manner?
If your users are typical, they will create a monster, which will end up running slow, and they will hate it. Most users will view the data as an Excel sheet, and not understand relationships like: parent/child. As a result there will be some crazy objects built, and impossible-to-solve reports. You'll be forced to create scripts to manually convert many old objects to better and properly defined ones, etc...
Your requirements sound a little bit like an associative database with a front end to compose and edit entities.
I agree with KM above, unless you have a very compelling reason not to, you would be better off using a traditional approach. There are a lot of development tools and practices that allow you to build a robust and scalable system. Otherwise you will have to implement much of this yourself.
I don't know the best way to do this, because it sounds like something that has already been implemented by others. If I were asked to implement this feature, I would recommend buying a wheel instead of reinventing it.
Perhaps there are reasons you have to invent your own? If so, then you should add those reasons to the requirements you listed.
If you absolutely must be this generic, I still recommend buying a system that has been architected for this requirement. Not just the storage requirements, which are the least of the problems your customer will have; but also: how do you keep the customer from screwing up totally when given this much freedom. Some of the commercial systems already meet this challenge without going out of business because of customers messing up.
If you still need to do this on your own, then I suggest that your requirements (or perhaps those of another vendor?) must include: allow the customer to get it right, and help keep the customer from getting it wrong. You'll need some sort of UI to allow the customer to define these business objects, and the UI should validate the model that the customer builds.
I recommend a UI that works at a conceptual level. As an example, see NORMA, a Visual Studio add-in for Object-Role Modeling (the "other" ORM). Consider it as a example only, if your end users cannot afford a Visual Studio Standard license. Otherwise, you'll find that it is extensible, already produces many types of artifact (from SQL in various dialects to code), and will validate the model to see that it makes sense. End users would also be able to enter sample data that they believe should be valid, and the system will validate the data against the model.
If your customers are producing sensible (if dynamic) business objects, then the question of storage will be much simpler.
Have you thought about an XML based solution? The requirements suggested to me "Build a system that allows users to dynamically generate an XML Schema and work with XML documents based on that schema." I don't know enough about storing and querying XML documents to comment on your original question.
Another possibility might be to leverage NHibernate's ability to generate database schemas. If you can dynamically generate business objects, then you can generate XML mappings or Fluent mappings and use that to generate a normalized database schema.
Every user that I have ever talked to has always wanted "everything" in their project. Part of the job of gathering requirements is to guide the user, not just write down everything they say.
Your only hope is to build several template objects, that they can add properties to, you could code your application to handle each type of these objects, but allow the user to still slightly modify each as necessary.
You need to inform the user upfront of the major flaws this type of design has. This will help you in the end, when it runs slow, or if they screw up and need help fixing something. I'd put this in writing.
How many possible objects would they really need? Perhaps you could set these up using your system first. I have developed several very customizable systems over the years and when the user is sitting at an empty screen, it is like a deer in the headlights.
In any event, good luck.

What is the best approach for decoupled database design in terms of data sharing?

I have a series of Oracle databases that need to access each other's data. The most efficient way to do this is to use database links - setting up a few database links I can get data from A to B with the minimum of fuss. The problem for me is that you end up with a tightly-coupled design and if one database goes down it can bring the coupled databases with it (or perhaps part of an application on those databases).
What alternative approaches have you tried for sharing data between Oracle databases?
Update after a couple of responses...
I wasn't thinking so much a replication, more on accessing "master data". For example, if I have a central database with currency conversion rates and I want to pull a rate into a separate database (application). For such a small dataset igor-db's suggestion of materialized views over DB links would work beautifully. However, when you are dynamically sampling from a very large dataset then the option of locally caching starts to become trickier. What options would you go for in these circumstances. I wondered about an XML service but tuinstoel (in a comment to le dorfier's reply) rightly questioned the overhead involved.
Summary of responses...
On the whole I think igor-db is closest, which is why I've accepted that answer, but I thought I'd add a little to bring out some of the other answers.
For my purposes, where I'm looking at data replication only, it looks like Oracle BASIC replication (as opposed to ADVANCED) replication is the one for me. Using materialized view logs on the master site and materialized views on the snapshot site looks like an excellent way forward.
Where this isn't an option, perhaps where the data volumes make full table replication an issue, then a messaging solution seems the most appropriate Oracle solution. Oracle Advanced Queueing seems the quickest and easiest way to set up a messaging solution.
The least preferable approach seems to be roll-your-own XML web services but only where the relative ease of Advanced Queueing isn't an option.
Streams is the Oracle replication technology.
You can use MVs over database links (so database 'A' has a materialized view of the data from database 'B'. If 'B' goes down, the MV can't be refreshed but the data is still in 'A').
Mileage may depend on DB volumes, change volumes...
It looks to me like it's by definition tightly coupled if you need simultaneous synchronous access to multiple databases.
If this is about transferring data, for instance, and it can be asynchronous, you can install a message queue between the two and have two processes, with one reading from the source and the other writing to the sink.
The OP has provided more information. He states that the dataset is very large. Well how large is large? And how often are the master tables changed?
With the use of materialized view logs Oracle will only propagate the changes made in the master table. A complete refresh of the data isn't necessary. Oracle streams also only communicate the modifications to the other side.
Buying storage is cheap, so why not local caching? Much cheaper than programming your own solutions.
An XML service doesn't help you when its database is not available so I don't understand why it would help? Oracle has many options for replication, explore them.
edit
I've build xml services. They provide interoperability between different systems with a clear interface (contract). You can build a xml service in C# and consume the service with Java. However xml services are not fast.
Why not use Advanced Queuing? Why roll your own XML service to move messages (DML) between Oracle instances - It's already there. You can have propagation move messages from one instance to another when they are both up. You can process them as needed in the destination servers. AQ is really rather simple to set up and use.
Why do they need to be separate databases?
Having a single database/instance with multiple schemas might be easier.
Keeping one database up (with appropriate standby databases etc) will be easier than keeping N up.
What kind of immediacy do you need and how much bi-directionality? If the data can be a little older and can be pulled from one "master source", create a series of simple ETL scripts run on a schedule to pull the data from the "source" database into the others.
You can then tailor the structure of the data to feed the needs of the client database(s) more precisely and you can change the structure of the source data until you're blue in the face.

Resources