We have following setup:
server SRV_1 has two databases: OLTP database DB_APP and reporting database DB_REP.
DB_APP has a number of triggers which do some data transformation (denormalization, uses joins to other tables to match denormalized data) and inserts/updates data into DB_REP
SRV_1 has transactional replication which publish data from DB_REP
There are few jobs on SRV_1 which regularly modify data in DB_REP basing on data from DB_APP
DB_REP is not used by any application, it is only a source for replication
server SRV_2 is a reporting server which subscribes for DB_REP and makes it available to customers.
The problem is that DB_APP has few tables which are often updated and the triggers make really heavy CPU load. One of the most often updated table has about 100 updates per second.
What is the best approach to optimize this setup? So far I was thinking about such:
try to optimize triggers as much as possible (right now they use if exists check to do insert/update and I can replace it with MERGE)
create indexed views instead of triggers and replicate views from DB_APP to DB_REP
remove triggers, remove DB_REP and publish replication directly from DB_APP using replication stored procedures sp_msins_xxx with data transformation implemented. I am not sure if I can have queries to other tables in these procedures. And what to do with jobs?
I don't think it's a bad design to have a separate DB for replication, especially as the data is being transformed from your transactional DB.
I do however think it's bad to populate the replicated DB using triggers. We used to have a similar kind of setup, and as you're finding the performance suffers as transaction volume grows.
I don't think its the best idea to replicate directly from the transactional DB either - replication (transactional and merge anyway) uses triggers itself, and the transformation options are pretty much limited to row and column filtering. The last thing you want to do is start modifying the replication procs or triggers!
I highly recommend moving all your update/transformation logic out of the triggers and into stored procs that can be called from a batch update job which is run periodically (e.g. overnight).
This is what we did with the majority of our triggers, and the performance gains can be huge. Our users do have to put up with the fact that some data is no longer up to the minute, but that wasn't a big problem in our case.
Related
I have a datawarehouse on a microsoft sql server and many complex queries involving a lot of joins between tables. Each query will return me a structure which will then be used to populate an object in my mongodb database.
The queries can change and involve new tables so my strategy would be the following:
I would create some materialized views (of course microsoft does things at its own liking, so it seems that those views do not exist, but are rendered as normal views+index, is it the same I wonder?).
I would set a proper update period for the view
Kafka would then listen for events on those views
I'm not so sure about this approach because I don't know how and if this dbms would produce event logs for materialized views too, nor if kafka would interpret them as changes to the tables.
The alternative would be to listen for events on every single table but as I stated they are a lot and could change, so there would be a lot of maintenance involved.
What do you think?
As commented, views don't emit events
You can Kafka Connect JDBC to query a view just as any other table, though
Otherwise, you would need different topics to perform filters and joins
We have 70+ SQL Server 2008 databases that need to be copied from an OLTP environment to a separate reporting server. Once the DB's are copied, we will do some partial data transformation: de-normalization, row level security, etc.
SSRS Reports will be written based on these static denormalized tables and views.
We have a small nightly window for copying and transforming all 70 databases (3 hours).
Currently databases average about 10GB.
Options:
1. Transactional replication:
We would need to create 100+ static denormalized tables on each reporting database.
Doing this for all 70 databases almost reaches our nightly time limit.
As the databases grow we will exceed the time limit. We thought of mixing denormalized tables with views to speed up transformation. But then there would be some dynamic and some static data which is not a solution we can use.
Also with 70 databases using transactional replication we are concerned about bandwidth usage.
2. Snapshot replication:
Copy the entire database each night.
This means we could have a mixture of denormalized tables and views so the data transformation process is quicker.
But the snapshot is a full data copy, so as the DB grows, we will exceed our time limit for completing copy and transformation.
3. Log shipping:
In our nightly window, we could use the log shipping to update the reporting databases, then truncate and repopulate the denormalized tables and use some views.
However, I understand that with log shipping, extra tables and views cannot be added to the subscribing database.
4. Mirroring:
Mirroring is being deprecated, but also the DB is not active for reporting against until failover.
5. SQL Server 2012 AlwaysOn.
We don't have SQL Server 2012 yet, can this be configured to do an update once a day instead of realtime?
And can extra tables and views be created on the subscribing database (our reporting databases)?
6. Merge replication:
This is meant to be for combining multiple data sources into one database.
But is looks like it allows for a scheduled update (once per day) and only updates the subscriber DB with the latest changes rather than doing an entire snapshot.
It requires adding a rowversion column to every table but we could handle this. Also with this solution would additional tables be able to be created on the subscriber database without the update getting out of sync?
The final option is that we use SSIS to select only the data we need from the OLTP databases. I think this options creates more risk as we would have to handle inserts/updates/deletes to our denormalized tables, rather than just drop and recreate the denormalized tables daily.
Any help on our options would be greatly appreciated.
If I've made any incorrect assumptions, please say.
If it were me, I'd go with transactional replication that runs continuously and have views (possibly indexed) at the subscriber. This has the advantage of not having to wait for the data to come over since it's always coming over.
We have a normalized SQL Server 2008 database designed using generic tables. So, instead of having a separate table for each entity (e.g. Products, Orders, OrderItems, etc), we have generic tables (Entities, Instances, Relationships, Attributes, etc).
We have decided to have a separate denormalized database for quick retrieval of data. Could you please advise me of various technologies out there to synchronize these 2 databases, assuming they have different schemas?
Cheers,
Mosh
When two databases have so radically different schemas you should be looking at techniques for data migration or replication, not synchronization. SQL Server provides two technologies for this, SSIS and Replication, or you can write your own script to do this.
Replication will take new or modified data from a source database and copy it to a target database. It provides mechanisms for scheduling, packaging and distributing changes and can handle both real-time as well as batch updates. To work it needs to add enough info in both databases to track modifications and matching rows. In your case it would be hard to identify which "Products" have changed as you would have to identify all relevant modified rows in 4 or more different tables. It can be done but it will require some effort. In any case, you would have to create views that match the target schema, as replication doesn't allow any transformation of the source data.
SSIS will pull data from one source, transform it and push it to a target. It has no built-in mechanisms for tracking changes so you will have to add fields to your tables to track changes. It is strictly a batch process that can run according to a schedule. The main benefit is that you can perform a wide variety of transformations while replication allows almost none (apart from drawing the data from a view). You could create dataflows that modify only the relevant Product field when a Product related Attribute record changes, or simply reconstitute an entire Product record and overwrite the target record.
Finally, you can create your own triggers or stored procedures that will run when the data changes and copy it from one database to the other.
I should also point out that you have probably over-normalized your database. In all three cases you will have some performance penalty when you join all tables to reconstitute an entity, resulting in a larger amount of locking that is necessary and inefficient use of indexes. You are sacrificing performance and scalability for the sake of ease of change.
Perhaps you should take a look at the Sparse Column feature of SQL Server 2008 for a way to support flexible schemas while maintaining performance and scalability.
I have a very large (100+ gigs) SQL Server 2005 database that receives a large number of inserts and updates, with less frequent selects. The selects require a lot of indexes to keep them functioning well, but it appears the number of indexes is effecting the efficiency of the inserts and updates.
Question: Is there a method for keeping two copies of a database where one is used for the inserts and updates while the second is used for the selects? The second copy wouldn't need to be real-time updated, but shouldn't be more than an hour old. Is it possible to do this kind of replication while keeping different indexes on each database copy? Perhaps you have other solutions?
Your looking to setup a master/child database topology using replication. With SQL server you'll need to setup replication between two databases (preferrably on separate hardware). The Master DB you should use for inserts and updates. The Child will service all your select queries. You'll want to also optimize both database configuration settings for the type of work they will be performing. If you have heavy select queries on the child database you may also want to setup view's that will make the queries perform better than complex joins on tables.
Some reference material on replication:
http://technet.microsoft.com/en-us/library/ms151198.aspx
Just google it and you'll find plenty of information on how to setup and configure:
http://search.aim.com/search/search?&query=sql+server+2005+replication&invocationType=tb50fftrab
Transactional replication can do this as the subscriber can have a number of aditional indexes compared with the publisher. But you have to bear in mind a simple fact: all inserts/updates/deletes are going to be replicated at the reporting copy (the subscriber) and the aditional indexes will... slow down replication. It is actually possible to slow down the replication to a rate at wich is unable to keep up, causing a swell of the distribution DB. But this is only when you have a constant high rate of updates. If the problems only occur durink spikes, then the distribution DB will act as a queue that absorbes the spikes and levels them off during off-peak hours.
I would not take this endevour without absolute, 100% proof evidence that it is the additional indexes that are slowing down the insert/updates/deletes, and w/o testing that the insert/updates/deletes are actually performing significantly better without the extra indexes. Specifically , ensure that the culprit is not the other usual suspect: lock contention.
Generally, all set-based operations (including updating indexes) are faster than non set-based ones
1,000 inserts will most probably be slower than one insert of 1,000 records.
You can batch the updates to the second database. This will, first, make the index updating more fast, and, second, smooth the peaks.
You could task schedule a bcp script to copy the data to the other DB.
You could also try transaction log shipping to update the read only db.
Don't forget to adjust the fill factor when you create your two databases. It should be low(er) on the database with frequent updates, and 100 on your "data warehouse"/read only database.
We have a database on SQL Server 2000 which should be truncated from time to time. It looks like the easiest solution would be to create a duplicate database and copy the primary database there. Then the primary database may be safely truncated by specially tailored stored procedures.
One way replication would guarantee that the backup database contains all updates from the primary one.
We plan to use backup database for reporting and primary for operative data.
Primary database will be truncated at night once in 2 days.
Database is several gigabytes. Only several tables are quite large (1-2 mln rows)
What are possible pitfalls? How reliable would such a solution be? Will it slow down the primary database?
Update: Variant with DTS for doing copying sounds good but has own disadvantages. It requires quite robust script which would run for about an hour to copy updated rows. There is also issue with integrity constraints in primary database which would make truncating it non-trivial task. Because of this replication cold straighten things up considerably.
It is also possible but not quite good variant to use union VIEW because system which woks mostly in unattended mode whiteout dedicated support personnel. It is related issue but not technical though.
While replication is usually robust, there are times where it can break and require a refresh. Managing and maintaining replication can become complicated. Once the primary database is truncated, you'll have to make sure that action is not replicated. You may also need an improved system of row identification as after you've truncated the primary database tables a couple of times, you'll still have a complete history in your secondary database.
There is a performance hit on the publisher (primary) as extra threads have to run to read the transaction log. Unless you're under heavy load at the moment, you likely won't notice this effect. Transaction log management can become more important also.
Instead, I'd look at a different solution for your problem. For example, before truncating, you can take a backup of the database, and restore it as a new database name. You then have a copy of the database as it was before the truncation, and you can query both at once using three-part names.
You've mentioned that the purpose of the secondary data is to keep report off. In this case you can create a view like SELECT * FROM Primary.dbo.Table UNION ALL SELECT * FROM SecondaryDBJune2008.dbo.Table UNION ALL SELECT * FROM SecondaryDBOctober2008.dbo.Table. You wouild then need to keep this view up to date whenever you perform a truncate.
The other alternative would be to take a snapshot of the current data before truncation and insert it into a single reporting database. Then you'd just have the Primary and the Historical databases - no need to modify views once they're created.
How much data are we talking about in GB?
As you're planning to perform the truncation once every two days, I'd recommend the second alternative, snapshotting the data before truncation into a single Historical database. This can be easily done with a SQL Agent job, without having to worry about replication keeping the two sets of data in synch.
I would not use replication for this. We have a fairly complex replication setup running with 80+ branches replicating a few tables to one central database. When connectivity goes down for a few days, the data management issues are hair raising.
If you want to archive older data, rather use DTS. You can then build the copying and truncation/deletion of data into the same DTS package, setting it so that the deletion only happens if the copy was successful.