How does ETL (database to database) fit into SOA?

How does ETL (database to database) fit into SOA? - database

Lets imagine, that our application needs ETL (extract, transform, load) data from relation database to another relation database.
Most simple (and most performance, IMHO) way is to make link between databases and write simple stored procedure. In this case we use minimal technologies and components, all features are "out of the box".
But is it good practice for SOA (service-oriented architecture)? What about tight coupling? Do we strongly couple the databases to each other for ever?
There is another way to do this: we build 2 java applications in each side and communicate by SOAP web services. This is more SOA friendly! But are the performance degradation and additional points of failure worth it?
What will be the best practice in this case? How can ETL fit within SOA?

In SOA, you can adapt Biztalk or SAP BusinessObjects Data Integrator way of processing. Basically, it is a scheduler job / windows service, or something similar. You provide two service point, 1 for the scheduler to retrieve the data, and another for the scheduler to send the data. The scheduler's responsibility here is just to run periodically and transforming data.
So, the basic steps will be:
Step 1: The scheduler run and get the data from service A
Scheduler --get--> Service A
Service A --data--> Scheduler
Step 2: The scheduler doing data transformation
[ Conversion --> Conversion --> Conversion --> Conversion ]
Step 3: The scheduler send the data to another service
Scheduler --data--> Service B
In both Biztalk and SAP BusinessObject Data Integrator, the steps are configurable (they can retrieve from whatever service and can do scripting data transformation), so it's more flexible.
However, there are still usual problems that can happen with ETL processing. For example: the data is too big, network performance impact, RTO's, duplicated data, etc. So the ETL best practices still a requirement here (use of staging table, logging, etc).
But are the performance degradation and additional points of failure
worth it?
The performance impact will happen since now you have extra connection/authentication step (to webservice), and transportation step (webservice to scheduler via protocol). But for error-prone, I think it's the same error that you need to handle with other service call.
Is it worth it? It depends. If you are working in same environment (same database) then it's debatable. If you are working in different environment (two different system for example, from Asp.Net to SAP, or different database instance at least), then this architecture is the best bet to handle ETL.

ETL in general fits into SOA - e.g. SOA services may perform ETL operations between each-other.
Database-to-database linkage is very useful when you want to replicate databases or in other similar situations. In general, this approach has nothing to do with SOA, unless the below cases exist.
Database-to-database linkage does not fit into SOA when both these databases are consumed by SOA services. In this case, you should communicate through services.
Database-to-database linkage still fits into SOA when only one database is the persistence for the SOA service. The other one can be considered as a failover or a simple replication, not directly related to SOA. In this case, database-to-database linkage simply becomes a data-related concern, which you are allowed to have and to solve.

For me there are several points missing in the db - to - db and the Rest -based setup:
Exceptions in the etl process:
When is the transformation of data considered to be valid? How is the result of an unsuccessful transformation handled? Just throwing the data away is not an option in most cases.
System Failure / Recovering
What if one / both systems is down for a while? How is synchronization handled?
When did the etl fail and where does it has to be restarted ?
So instead of having to databases or rest - services communicate with each other imho this is more related to using migration technologies such as Apache Camel or using ESB's which can handle the transformations, split data, process it asyncronously , put it back together, have a proper monitoring, recovering, load balance for performance optimization. This will not necessaryily speed up the 'E' in etl, nor the 'L' (though it might in both), but certainly speed up the 'T' and has positiv outcomes for data integrity.
And of course: ESB's are SOA - related technologies. Apache Camel for me is not really though it is considered to be a reference implementation of Enterprise Integration Patterns.
Basically the idea behind it is that etl are content - based and not structure - based problems.
So what you could do with these techniques is something like:
DB <- DataExtractor - Validator
- ContentLengthBasedRouter - Splitter
(Ansynch)
- Transformer1 ,
- Transformer 2 ..
- Aggregator -
- ContentBasedRouter - Transformer3 -
- DataInserter
- Monitor
and more but that does not suit into a textual description.

All of these answers are good and helpful.
As I now understand SOA is not about implementing application, but about Architecture ("A"), mainly Enterprise Architecture. Enterprise main management method is delegation of responsibility for Services ("S").
So if there are two different business functions in the enterprise structure with two different responsible accounts, we should divide it in two different services with well defined contracts (interfaces), politics and audit methods - that is the main SOA purpose.
But if it is an atomic function with one responsible person, there is no need in SOA so much and we should use simple technologies and implement simple and rapid solid service application.
As about my original question it is lack of task context information.
Now I understand that database links should not be implemented across services, and it is bad design because has no enterprise management compatibility.
But within a service it may be good simple solution.
Thanks to everybody answered.

Related

keeping databases in sync (after write/update) across regions/zones

I have to write a webservice in php to serve at three different zones/(cities or countries). Each zone will have its own machine to run this web service instance behind every webservice is a database which is exact clone/copy in each region, web service serves the clients with data from db. Main reason for multiples instances of web service is to distribute client load.
The clients can make read and write calls via web service APIs.
Write calls will modify the database for that instance but this change has to be applied as soon as possible to all databases in other zones also as all the databases in each zone are clones and exact copies, so changes in one db must be synced in all the databases in other zones.
I presume the write calls must go to some kind of master server which coordinates among all the web services etc. But I am sure this pattern is quite common and some solution is already out there.
Please advise if there is any database or application level technique which would keep the databases in sync when there are write calls so that modification or addition is reflected in all instances of db ? I can choose the database of my choice but primary choice would be mysql server or postgres, but can change to other database which can solve this issue.

You're right, this pattern is quite common and there is a name for it - Synchronous Master-Master replication. Most modern RDBMS support it:
PosgreSQL supports it thru pg_cluster https://wiki.postgresql.org/wiki/PgCluster
MySQL https://www.howtoforge.com/mysql_master_master_replication
But before implementing it straight away I'd recommend reading more about different types of replication, their pros and cons:
https://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling
https://dev.mysql.com/doc/refman/8.0/en/replication.html
Synchronous Master-Master replication will be quite slow, especially in a multi-zone scenario, so you might consider other techniques:
Asynchronous replication
Sharding/Partitioning
A mix of sharding and replication
There is a very good book on different distributed techniques(including sharding and replication) - "Designing Data Intensive Applications" by Martin Kleppmann.

Replication techniques are definitely worth looking at, but there can be a certain amount of technical overhead and cost to replication. I work for a company called Redactics (https://www.redactics.com), and we came up with a simpler solution that is sort of a near realtime replication based on delta updates using a pure SQL approach.
There are certainly pros and cons to both approaches, I'm not trying to push Redactics hard if this is not the most appropriate solution for your needs, but Redactics simply tracks the most recent primary keys and uses modification timestamps to find new and changed records, and then copies them over. You can run the sync pretty often without a lot of load since it is just a delta update. Obviously any workflow can break, but repairing broken replication can be tricky, so we like this approach and running these sync workflows within your own infrastructure.

Database for a java application in cluster

I'd like to play around with kubernetes, I'm able to start a simple app, but now I'd like to design something more complex. Nevertheless I can't figure out, how to handle the database access in such architecture.
Let's say I have 100 pod replicas of some simple chat application. They all need to access the same database (or more like data set) and perform CRUD operations upon them. How to design it to keep the data consistent and eliminate the risk of deadlocks?
If possible, I'd like to use SQL-like database, so I can comfortably use hibernate and other tools I'm familiar with.
Is this even possible or do I have to use totally a different approach? What is the name of the technology or architecture I'm searching for?

1) You can use a connection pool to reduce this number and make the connection settings more aggressive/elastic;
2) Split your microservices in such way the access to the persistence is a microservice exposing your CRUD service to your persistence(mysql/rdms/nosql/etc). In that way you most likely don't need hundreds of replicas of your pods.
3) Deadlocks / locking strategies - as Andrew mentioned in the comments, it's more related to your software development architecture rather than K8s itself. There are plenty of ways to deal with that with pros/cons.

Stateless Micro services and database

We have a requirement of building stateless micro services which rely on a database cluster to persist data.
What is the approach that is recommended for redundant stateless micro services(for high availability and scalability) using the database cluster. For example: Running multiple copies of version 1.0 Payment service.
Should all the redundant micro services use a common shared DB schema or they should have their own schema? In case of independent DB schema inconsistency among the redundant services may exist.
Also how can the schema upgrade handled in case of common DB schema?

This is a super broad topic, and rather hard to answer in general terms.
However...
A key requirement for a micro service architecture is that each service should be independent from the others. You should be able to deploy, modify, improve, scale your micro service independently from the others.
This means you do not want to share anything other than API definitions. You certainly don't want to share a schema; each service should be able to define its own schema, release new versions, change data types etc. without having to check with the other services. That's almost impossible with a shared schema.
You may not want to share a physical server. Sharing a server means you cannot make independent promises on scalability and up-time; a big part of the micro service approach means that the team that builds it is also responsible for running it. You really want to avoid the "well, it worked in dev, so if it doesn't scale on production, it's the operations team's problem" attitude. Databases - especially clustered, redundant databases - can be expensive, so you might compromise on this if you really need this.
As most microservice solutions use containerization and cloud hosting, it's quite unlikely that you'd have the "one database server to rule them all" sitting around. You may find it much better to have each micro service run its own persistence service, rather than sharing.
The common approach to dealing with inconsistencies is to accept them - but to use CQRS to distribute data between microservices, and make sure the micro services deal with their internal consistency requirements.
This also deals with the "should I upgrade my database when I release a new version?" question. If your observers understand the version for each message, they can make decisions on how to store them. For instance, if version 1.0 uses a different set of attributes to version 1.1, the listener can do the mapping.
In the comments, you ask about consistency. This is a super complex topic - especially in micro service architectures.
If you have, for instance, a "customers" service and an "orders" service, you must make sure that all orders have a valid customer. In a monolithic application, with a single database, and exclusively synchronous interactions, that's easy to enforce at the database level.
In a micro service architecture, where you might have lots of data stores, with no dependencies on each other, and a combination of synchronous and asynchronous calls, it's really hard. This is an inevitable side effect of reducing dependencies between micro services.
The most common approach is "eventual consistency". This typically requires a slightly different application design. For instance, on the "orders" screen, you would invoke first the client microservice (to get client data), and then the "orders" service (to get order details), rather than have a single (large) service call to retrieve everything.

How to design/develop an integration layer or bus for different external services/apps

We are currently looking into replacing one of our apps with possibly an ESB or some similar tool and was looking for some insights into how best to approach this.
We currently have a stand alone service that consumes/interact with different external services and data sources, some delivered through SOAP Web Services and others we just use a DB connection. This service is exposed through SOAP and we have other apps that consume this service but are very tightly coupled to it, now we also have other apps that need to consume some of the external services and would like to replace this all together with an ESB or some sort of SOA platform.
What would be the best way to replace this 'external' services integration layer with an ESB? We were thinking of having a 'global' contract/API in which all of the services we consume are exposed as one single contract where all the possible operations and data structures that we use are exposed under one single namespace, would this be the best way of approaching this? and if so are there any tools that could help us automate this process or do we basically have to handcraft this contract/API?. This would also mean that for any changes to the underlying services/API's we will have to update this new API as well.
If not then the other option I see is to basically use the 'ESB' as a 'proxy' layer in which all of our sources are exposed as they are, so we would end up with several different 'contracts' / API endpoints, but I don't really see the value in this.
Also given the above what would be the best tool for the job? is a full blown ESB an overkill or are we much better rolling our own using something like Apache Camel or Spring Integration?.
A few more details:
We are currently integrating over 5 different external services with more to come in the future.
Only a couple of apps consuming our current app at the moment but several other apps/systems in the future will need to consume some of these external services.
We are currently using a single method of communication (SOAP) between these services but some apps might use pub/sub messaging in the future, although SOAP will still be the main protocol used.
I am new to ESB integration so I apologize in advance if I'm misunderstanding a lot of these technologies and the problems they are meant to solve.
Any help/tips/pointers will be greatly appreciated.
Thanks.

You need to put in some design thoughts of what you want to achieve over time.
There are multiple benefits and potential pitfalls with an ESB introduction.
Here are some typical benefits/use cases
When your applications are hard to change or have very different release cycles - then it's convenient to have an ESB in the middle that can adopt the changes quickly. This is very much the case when your organization buys a lot of COTS products and cloud services that might come with an update the next day that breaks the current API.
When you need to adapt data from one master data system to several other systems and they might not support the same interfaces, i.e. CRM system might want data imported via web services as soon as it's available, ERP want data through db/staging tables and production system wants data every weekend in a flat file delivered via FTP. To keep the master data system clean and easy to maintain, just implement one single integration service in the master data system, and adapt this interface to the various other applications within the ESB plattform instead.
Aggregation or splitting of data from various sources to protect your sensitive systems might be a use case. Say that you have an old system that can take a small updates of information at a time and it's not worth to upgrade this system - then an integration solution that can do aggreggation or splitting or throttling can be a good solution.
Other benefits and use cases include the ability to track and wire tap every message passing between systems - which can even be used together with business intelligence tools to gather KPI:s.
A conceptual ESB can also introduce a canonical message format that is used for all services that needs to communicate. If a lot of applications share the same data with several other applications (not only point to point) - then the benefits of a canonical message format can outweight the cost (which is/can be high). An ESB server might be useful to deal with canonical data as it is usually very good at mapping from one format to another.
However, introducing an ESB without a plan what benefits you are trying to achieve is not really a good thing, since it introduces overhead - you need another server to keep alive, you need perhaps another team to understand all data flows. You need particular knowledge with your integration product. Finally, you need to be able to have some governance around it so that your ESB initiative does not drift away from the goals/benefits you have foreseen.
You should choose some technology that you are comfortable with - or think you can be comfortable with. Apache Camel is indeed very powerful and my favorite integration engine - but it's not an ESB as it does not come with a runtime that you can use to deploy/manage/monitor your integration services with. You can use it together with most Java EE application servers or even better - Apache ServiceMix (= Karaf+Camel+ActiveMQ+CXF) which is built for this task.
The same goes with spring integration - you need to run it somewhere, app servers or what not.
There is a large set of different products, both open source and commercial that does these things.

Synchronizing intranet and web data

I am just getting started breaking a .NET application and its SQL Server database into two systems - an intranet and a public website.
The various database tables will need to be synchronised between the two databases in different ways, for example:
Moving from web to intranet, with the intranet data becoming read-only
Moving from intranet to web, with the web data becoming read-only
Tables that need to be synchronised and are read/write on both the intranet and web databases.
Some of the synchronisation needs to occur relatively quickly with minimal lag, possibly with some type of transaction locking to ensure repeatable reads etc. Other times it doesn't matter if there is a delay between synchronisation.
I am not quite sure where to start with all this, as there seems to be many different ways of achieving this. Which technologies and strategies should I be looking at?
Any tips?

A system like that looks like the components are fairly tightly coupled. An upgrade across several systems all at once can turn into quite the nightmare.
It looks like this is less of a replication problem and more of a problem of how to maintain a constant connection to a remote database without much I/O lag. While it can be done, probably isn't going to work out very well in terms of scalability and being able to troubleshoot problems.
You might look at using some message queueing and asynchronous data processing from the remote site to the intranet. You'll probably have to adjust some expectations of the business side so that they don't assume that everything is accessible real-time all the time.
Of course, its hard to give specifics without more details. It might be a good idea to look into principles of SOA and messaging systems for what you're trying to do.

Out of the box you have SQL Server Replication. Sounds like a pair of filtered transactional replication publications can do the job. Transactional replication has a low overhead on the publisher and can ensure transactional consistency of the published changes.
Nathan raises some very valid points about the need for a more loosely coupled solution. Service Broker can fit that shoe quite well with its loosely coupled asynchronous nature, and provide a headache free upgrade future since SSB is compatible between SQL Server versions and editions. But this freedom comes at the cost of letting the heavy lifting of actually detecting the changes and applying them to the tables to you, as application code, not a trivial feats.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight