Is using views a good strategy to ETL complex queries through kafka?

Is using views a good strategy to ETL complex queries through kafka? - sql-server

I have a datawarehouse on a microsoft sql server and many complex queries involving a lot of joins between tables. Each query will return me a structure which will then be used to populate an object in my mongodb database.
The queries can change and involve new tables so my strategy would be the following:
I would create some materialized views (of course microsoft does things at its own liking, so it seems that those views do not exist, but are rendered as normal views+index, is it the same I wonder?).
I would set a proper update period for the view
Kafka would then listen for events on those views
I'm not so sure about this approach because I don't know how and if this dbms would produce event logs for materialized views too, nor if kafka would interpret them as changes to the tables.
The alternative would be to listen for events on every single table but as I stated they are a lot and could change, so there would be a lot of maintenance involved.
What do you think?

As commented, views don't emit events
You can Kafka Connect JDBC to query a view just as any other table, though
Otherwise, you would need different topics to perform filters and joins

Related

Performance of database views having complex queries

I have to implement some SQL views for a third-party report engine.
Each view has a complex query that joins multiple tables (some tables contain millions of rows)
Some view queries have subqueries
Views are not accessed by the application (Only read by the report engine)
I have the following concerns
we have some doughts this will badly impact the current application (since the complex queries in the view are running every time when accessing the view)
What would be the performance impact of executing the complex views? (memory, time, etc)
These are some other solutions we have for now.
Use new tables instead of views and update the tables using triggers and stored procedures.
Use a replicated database and create those views on that table (So, it will not affect the current system)
Can you give me comments on the above concerns and the solutions, please? New suggestions are welcome.

SQL Server replacement for trigger with data transformation

We have following setup:
server SRV_1 has two databases: OLTP database DB_APP and reporting database DB_REP.
DB_APP has a number of triggers which do some data transformation (denormalization, uses joins to other tables to match denormalized data) and inserts/updates data into DB_REP
SRV_1 has transactional replication which publish data from DB_REP
There are few jobs on SRV_1 which regularly modify data in DB_REP basing on data from DB_APP
DB_REP is not used by any application, it is only a source for replication
server SRV_2 is a reporting server which subscribes for DB_REP and makes it available to customers.
The problem is that DB_APP has few tables which are often updated and the triggers make really heavy CPU load. One of the most often updated table has about 100 updates per second.
What is the best approach to optimize this setup? So far I was thinking about such:
try to optimize triggers as much as possible (right now they use if exists check to do insert/update and I can replace it with MERGE)
create indexed views instead of triggers and replicate views from DB_APP to DB_REP
remove triggers, remove DB_REP and publish replication directly from DB_APP using replication stored procedures sp_msins_xxx with data transformation implemented. I am not sure if I can have queries to other tables in these procedures. And what to do with jobs?

I don't think it's a bad design to have a separate DB for replication, especially as the data is being transformed from your transactional DB.
I do however think it's bad to populate the replicated DB using triggers. We used to have a similar kind of setup, and as you're finding the performance suffers as transaction volume grows.
I don't think its the best idea to replicate directly from the transactional DB either - replication (transactional and merge anyway) uses triggers itself, and the transformation options are pretty much limited to row and column filtering. The last thing you want to do is start modifying the replication procs or triggers!
I highly recommend moving all your update/transformation logic out of the triggers and into stored procs that can be called from a batch update job which is run periodically (e.g. overnight).
This is what we did with the majority of our triggers, and the performance gains can be huge. Our users do have to put up with the fact that some data is no longer up to the minute, but that wasn't a big problem in our case.

Replicate a filtered subset of data: Merge or Transactional replication?

First of all thanks for reading.
I need to replicate a subset of data that is based on a join filter; filter based on a join with an other table (Microsoft:"Using join filters, you can extend a row filter from one published table to another."). This is the setting:
SQL Server 2012;
replication sources on a subscription of a transaction replication
replication needs to be one direction sync (from publisher to subscriber);
only one subscriber/subscription;
small dataset with not many transactions;
WAN network.
What I established so far:
Option 1 - Create views and replicate those to tables via Transactional replication.
pros: no triggers are used,
cons: objects like key, constraints are not replicated
Option 2 - Use Merge replication with the join filter and set #subscriber_upload_options = 2 (download only).
pros: native MS functionality, all objects are replicated
cons: merge replication uses triggers, these won't be fired with bulk loads.
The results of these two approaches are exactly the same. However the technique differs, for example the different Agents that are used.To my understanding Merge replication is especially for server - client architectures, which is not my case but.. it works..
Because of the result is the same I am a bit in doubt which approach I should follow. I was hoping that you can give me some points to consider or advice me in which approach I should follow.

For the setup given in this question, both Transactional and Merge replication types are good.
The only things for you to consider are:
If latency for data transfer to the Subscriber should be minimal, choose Transactional Replication.
If you require access to intermediate data states, choose Transactional Replication.
For example, if a row changes five times, transactional replication allows an application to respond to each change (such as firing a trigger), not simply the net data change to the row.
However, the type of replication you choose for an application depends on many factors.
Here are links to relevant articles on learn.microsoft.com:
"Types of Replication"
"Transactional Replication"
"Merge Replication"

Syncrhonizing 2 database with different schemas

We have a normalized SQL Server 2008 database designed using generic tables. So, instead of having a separate table for each entity (e.g. Products, Orders, OrderItems, etc), we have generic tables (Entities, Instances, Relationships, Attributes, etc).
We have decided to have a separate denormalized database for quick retrieval of data. Could you please advise me of various technologies out there to synchronize these 2 databases, assuming they have different schemas?
Cheers,
Mosh

When two databases have so radically different schemas you should be looking at techniques for data migration or replication, not synchronization. SQL Server provides two technologies for this, SSIS and Replication, or you can write your own script to do this.
Replication will take new or modified data from a source database and copy it to a target database. It provides mechanisms for scheduling, packaging and distributing changes and can handle both real-time as well as batch updates. To work it needs to add enough info in both databases to track modifications and matching rows. In your case it would be hard to identify which "Products" have changed as you would have to identify all relevant modified rows in 4 or more different tables. It can be done but it will require some effort. In any case, you would have to create views that match the target schema, as replication doesn't allow any transformation of the source data.
SSIS will pull data from one source, transform it and push it to a target. It has no built-in mechanisms for tracking changes so you will have to add fields to your tables to track changes. It is strictly a batch process that can run according to a schedule. The main benefit is that you can perform a wide variety of transformations while replication allows almost none (apart from drawing the data from a view). You could create dataflows that modify only the relevant Product field when a Product related Attribute record changes, or simply reconstitute an entire Product record and overwrite the target record.
Finally, you can create your own triggers or stored procedures that will run when the data changes and copy it from one database to the other.
I should also point out that you have probably over-normalized your database. In all three cases you will have some performance penalty when you join all tables to reconstitute an entity, resulting in a larger amount of locking that is necessary and inefficient use of indexes. You are sacrificing performance and scalability for the sake of ease of change.
Perhaps you should take a look at the Sparse Column feature of SQL Server 2008 for a way to support flexible schemas while maintaining performance and scalability.

On-demand refresh mode for indexed view (=Materialized views) on SQL Server?

I know Oracle offers several refreshmode options for their materialized views (on demand, on commit, periodically).
Does Microsoft SQLServer offer the same functions for their indexed views?
If not, how can I else use indexed views on SQLServer if my purpose is to export data on a daily+
on-demand basis, and want to avoid performance overhead problems? Does a workaround exist?

A materialized view in SQL Server is always up to date, with the overhead on the INSERT/UPDATE/DELETE that affects the view.
I'm not completely sure of what your require, you question isn't completely clear to me. However, if you only want the overhead one time, on a daily+ on-demand basis , I suggest that you drop the index when you don't need it and recreate it when you do. The index will be built when you create it, and it will be then up to date. When the index is dropped there will not be any overhead on your INSERT/UPDATE/DELETE commands.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight