Syncrhonizing 2 database with different schemas - sql-server

We have a normalized SQL Server 2008 database designed using generic tables. So, instead of having a separate table for each entity (e.g. Products, Orders, OrderItems, etc), we have generic tables (Entities, Instances, Relationships, Attributes, etc).
We have decided to have a separate denormalized database for quick retrieval of data. Could you please advise me of various technologies out there to synchronize these 2 databases, assuming they have different schemas?
Cheers,
Mosh

When two databases have so radically different schemas you should be looking at techniques for data migration or replication, not synchronization. SQL Server provides two technologies for this, SSIS and Replication, or you can write your own script to do this.
Replication will take new or modified data from a source database and copy it to a target database. It provides mechanisms for scheduling, packaging and distributing changes and can handle both real-time as well as batch updates. To work it needs to add enough info in both databases to track modifications and matching rows. In your case it would be hard to identify which "Products" have changed as you would have to identify all relevant modified rows in 4 or more different tables. It can be done but it will require some effort. In any case, you would have to create views that match the target schema, as replication doesn't allow any transformation of the source data.
SSIS will pull data from one source, transform it and push it to a target. It has no built-in mechanisms for tracking changes so you will have to add fields to your tables to track changes. It is strictly a batch process that can run according to a schedule. The main benefit is that you can perform a wide variety of transformations while replication allows almost none (apart from drawing the data from a view). You could create dataflows that modify only the relevant Product field when a Product related Attribute record changes, or simply reconstitute an entire Product record and overwrite the target record.
Finally, you can create your own triggers or stored procedures that will run when the data changes and copy it from one database to the other.
I should also point out that you have probably over-normalized your database. In all three cases you will have some performance penalty when you join all tables to reconstitute an entity, resulting in a larger amount of locking that is necessary and inefficient use of indexes. You are sacrificing performance and scalability for the sake of ease of change.
Perhaps you should take a look at the Sparse Column feature of SQL Server 2008 for a way to support flexible schemas while maintaining performance and scalability.

Related

Snowflake as a Backend database for a Frontend UI

We have a requirement wherein we want our business users to manipulate data in Snowflake through some UI interface (which might require creating additional reference tables etc)
Is it a good practice as Snowflake is for DW purpose and not for transactional data and are there any performance issues in doing so?
Typical actions could be filtering data, searching for particular IDs, updating/deleting certain row(s), etc. For these activities wanted to know if it's cost effective?
Yeah your point is correct, Snowflake db will suit DW workload better than the transactional one.
Having said that, we must also note that the Snowflake databases support a lot of features provided by traditional OLTP databases. For example, honouring ACID properties, transactional consistency, object recovery from accidental drops, read-only database copies for reporting purposes, data encryption, Role based access control, Secure Views, Materialized Views, support for semi-structured data, supportability of a wide variety of function such as Scalar Functions, Aggregation Functions, Window Functions, Table Functions, System Functions as well as support for External Function and customized User-defined Functions (UDFs) etc.
It also provides connectivity interface (driver/connectors) to connect to a variety of different database systems, big-data eco-systems as well as analytical tools.
The Snowflake engine also offers an amazing level of query performance and it has the ability to add dynamic compute power for higher concurrency and greater performance.
So if you are looking for a database for good query performance and doing operations such as querying data, filtering data and routine updates and deletes then it will serve the purpose.
But if you planning to create constraints on tables, PK/FK relationships between tables, indexes, a lot of single row Inserts, any isolation level apart from Read Committed, transaction management through a stored procedure etc. then it may not be a natural choice.
There is a concept of table clustering in place of Indexes. Single row Inserts must be converted into 'COPY Into Table' commands to reduce throttling and to get better performance. Primary Keys / Foreign Keys can be created but they are not enforced.

Analytics along with OLTP Database

I have a primary use case where I want to have a transactional relational database for which I am using Postgres.
I also need to run frequent aggregate queries (count, sum, average) on the data. These statistics cannot be precomputed as there are multiple filters for search that we have to provide.
I was initially thinking of using Redshift as a secondary storage, which can serve these queries, but then I would also need to build a system to keep the data in sync between the two storages.
Is there a better way to achieve this?
Take a look at AWS DMS, you can set this up to keep a near real time replica of your Postgres data on Redshift.
It is reliable and requires minimal maintenance (e.g. if you add new columns to your source data).
Read both of these carefully, especially limitations and requirements.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html
and
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.Redshift.html
Unless you need them, I recommend excluding text (and other large object) columns from the sync. this can be done easily by setting a flag, or can be tailored column by column.
The source Postgres database does not have to be held on AWS.

How to copy entire SQL Server 2008 database, applying WHERE clause to restrict copied data

To allow more realistic conditions during development and testing, we want to automate a process to copy our SQL Server 2008 databases down from production to developer workstations. Because these databases range in size from several GB up to 1-2 TB, it will take forever and not fit onto some machines (I'm talking to you, SSDs). I want to be able to press a button or run a script that can clone a database - structure and data - except be able to specify WHERE clauses during the data copy to reduce the size of the database.
I've found several partial solutions but nothing that is able to copy schema objects and a custom restricted data without requiring lots of manual labor to ensure objects/data are copied in correct order to satisfy dependencies, FK constraints, etc. I fully expect to write the WHERE clause for each table manually, but am hoping the rest can be automated so we can use this easily, quickly, and frequently. Bonus points if it automatically picks up new database objects as they are added.
Any help is greatly appreciated.
Snapshot replication with conditions on tables. That way you will get your schema and data replicated whenever needed.
This article describes how to create a merge replication, but when you choose snapshot replication the steps are the same. And the most interesting part is Step 8: Filter Table Rows. of course, because with this you can filter out all the unnecessary data to get replicated. But this step needs to be done for every entity and if you've got like hundreds of them, then you'd better analyze how to do that programmatically instead of going through the wizard windows.

SQL Server - Production DB Schema vs. Reporting DB Schema. Should they be the same?

We recently put a new production database into use. The schema of this database is optimized for OLTP. We're also getting ready to implement a reporting server to be used for reporting purposes. I'm not convinced we should just blindly use the same schema for our reporting database as we do for our production database, and replicate data over.
For those of you that have dealt with having separate production and reporting databases, have you chosen to use the same database schema for your reporting database, or a schema that is more efficient for reporting; for example, perhaps something more denormalized?
Thanks for thoughts on this.
There's really two sides to the story:
if you keep the schema identical, then updating the reporting database from the production is a simple copy (or MERGE in SQL Server 2008) command. On the other hand, the reports might get a bit harder to write, and might not perform optimally
if you devise a separate reporting schema, you can optimize it for reporting needs - then the creation of new reports might be easier and faster, and the reports should perform better. BUT: The updating is going to be harder
So it really boils down to: are you going to create a lot of reports? If so: I'd recommend coming up with a specific reporting schema optimized for reports.
Or is the main pain point the upgrade? If you can define and implement that once (e.g. with SQL Server Integration Services), maybe that's not really going to be a big issue after all?
Typically, chances are that you'll be creating a lot of reports of time, so there's a good chance it might be beneficial in the long run to invest a bit upfront in a separate reporting schema, and a data loading process (typically using SSIS) and then reap the benefit of having better performing reports and faster report creation time.
I think that the reporting database schema should be optimized for reporting - so you'll need a ETL Process to load your data. In my experience I was quickly at the point that the production schema does not fit my reporting needs.
If you are starting your reporting project I would suggest that you design your reporting database for your reports needs.
For serious reporting, usually you create data warehouse (Which is typically at least somewhat denormalized and certain types of calculations are done when the data is refreshed to save from averaging the values of 1.3 million records when you run the report. This is for the kind of reporting reporting that includes a lot of aggregate data.
If your reporting needs are not that great a replicated database might work. It may also depend on how up-to-date you need the data to be as data warehouses are typically updated once or twice a day so the reporting data is often one day behind, OK for monthly and quarterly reports not so good to see how many widgits have been ordered so far today.
The determinate of whether you need a data warehouse tends to be how long it would take to runthe reports they need. This is why datawarehouse pre-aggregate data on loading it. IF your reoports are running fine and you just want to get the worokload away from the input workload a replicated adatabase should do the trick. If you are trying to do math on all the records for the last ten years, you need a data warehouse.
You could do this in steps too. Do the replication now, to get reporting away from data input. That should be an immediate improvement (even if not as much as you want), then design and implement the datawarehouse (which can be a fairly long and involved project and which will take some time to get right).
It's easiest just to copy over.
You could add some views to that schema to simplify queries - to conceptually denormalize.
If you want to go the full Data Warehouse/Analysis Services route, it will be quite a bit of work. But it's very fast, takes up less space, and users seem to like it. If you're concerned about large amounts of data and response times, you should look into this.
If you have many many tables being joined, you might look into actually denormalizing the data. I'd do a test case just to see how much gain for pain you'll be getting.
Without going directly for the data warehouse solution you could always put together some views that rearrange data for better reporting access. This helps you in that you don't have to start a large warehouse project right away and could help scope out a warehouse project if you decide to go that way.
All the answers I've read here are good, I would just add that you do this in stages, stopping as soon as your goals for performance and functionality are met:
Keep the schema identical - this just takes contention and load off the OLTP server
Keep the schema identical - but add new indexed views OR index base tables differently
Build a partial data-warehouse style model (perhaps not keeping snapshot-style history or slowly changing dimensions or anything special not catered for in your normal database) from the copy-schema in another schema or database on the same reporting server. The benefits of star-schema models are huge for reporting, views flattened for users and data dictionaries etc. In this model, if your OLTP database loses changes (for instance customer name changes) due to overwrites, the data warehouse doesn't capture that information (often it's not that important if you stop at this spot). Effectively you are getting data warehouse-style organization for "current" data only. The benefits of retaining the copy of the original schema on your reporting server at this point are that you can pull from the source data in original SQL Server form instead of some kind of intermediate form (like text files) without affecting production OLTP, and you can migrate data models gradually, some in stars, some in normal form, all without affecting production. At some point later, you might be able to drop all or part of the copy.
Build a full data-warehouse including slowly changing dimensions where all the data is captured from the source system.

Transforming OLTP Relational Database to Data Warehousing Model

What are the common design approaches taken in loading data from a typical Entity-Relationship OLTP database model into a Kimball star schema Data Warehouse/Marts model?
Do you use a staging area to perform the transformation and then load into the warehouse?
How do you link data between the warehouse and the OLTP database?
Where/How do you manage the transformation process - in the database as sprocs, dts/ssis packages, or SQL from application code?
Personally, I tend to work as follows:
Design the data warehouse first. In particular, design the tables that are needed as part of the DW, ignoring any staging tables.
Design the ETL, using SSIS, but sometimes with SSIS calling stored procedures in the involved databases.
If any staging tables are required as part of the ETL, fine, but at the same time make sure they get cleaned up. A staging table used only as part of a single series of ETL steps should be truncated after those steps are completed, with or without success.
I have the SSIS packages refer to the OLTP database at least to pull data into the staging tables. Depending on the situation, they may process the OLTP tables directly into the data warehouse. All such queries are performed WITH(NOLOCK).
Document, Document, Document. Make it clear what inputs are used by each package, and where the output goes. Make sure to document the criteria by which the input are selected (last 24 hours? since last success? new identity values? all rows?)
This has worked well for me, though I admit I haven't done many of these projects, nor any really large ones.
I'm currently working on a small/mid size dataware house. We're adopting some of the concepts that Kimball puts forward, i.e. the star scheme with fact and dimension tables. We structure it so that facts only join to dimensions (not fact to fact or dimension to dimension - but this is our choice, not saying it's the way it should be done), so we flatten all dimension joins to the fact table.
We use SSIS to move the data from the production DB -> source DB -> staging DB -> reporting DB (we probably could have have used less DBs, but that's the way it's fallen).
SSIS is really nice as it's lets you structure your data flows very logically. We use a combination of SSIS components and stored procs, where one nice feature of SSIS is the ability to provide SQL commands as a transform between a source/destination data-flow. This means we can call stored procs on every row if we want, which can be useful (albeit a bit slower).
We're also using a new SQL Server 2008 feature called change data capture (CDC) which allows you to audit all changes on a table (you can specify which columns you want to look at in those tables), so we use that on the production DB to tell what has changed so we can move just those records across to the source DB for processing.
I agree with the highly rated answer but thought I'd add the following:
* Do you use a staging area to perform the transformation and then
load into the warehouse?
It depends on the type of transformation whether it will require staging. Staging offers benefits of breaking the ETL into more manageable chunks, but also provides a working area that allows manipulations to take place on the data without affecting the warehouse. It can help to have (at least) some dimension lookups in a staging area which store the keys from the OLTP system and the key of the latest dim record, to use as a lookup when loading your fact records.
The transformation happens in the ETL process itself, but it may or may not require some staging to help it along the way.
* How do you link data between the warehouse and the OLTP database?
It is useful to load the business keys (or actual primary keys if available) into the data warehouse as a reference back to the OLTP system. Also, auditing in the DW process should record the lineage of each bit of data by recording the load process that has loaded it.
* Where/How do you manage the transformation process - in the
database as sprocs, dts/ssis packages,
or SQL from application code?
This would typically be in SSIS packages, but often it is more performant to transform in the source query. Unfortunately this makes the source query quite complicated to understand and therefore maintain, so if performance is not an issue then transforming in the SSIS code is best. When you do this, this is another reason for having a staging area as then you can make more joins in the source query between different tables.
John Saunders' process explanation is a good.
If you are looking to implement a Datawarehouse project in SQL Server you will find all the information you require for the delivering the entire project within the excellent text "The Microsoft Data Warehouse Toolkit".
Funilly enough, one of the authors is Ralph Kimball :-)
You may want to take a look at Data Vault Modeling. It claims solving some loner term issues like changing attributes.

Resources