we have intra-day Data Warehouse loads through the day (using SSIS, SQL Server 2005).
The reporting is done through Business Objects (XI 3.1 WebI).
We are not currently facing any issues, but what are the Best Practices for intra-day Data Warehouse loads, where at the same time Reporting from the same Database?
thanks,
Amrit
Not sure If I understood you correctly, but I guess that the two main problems you may be facing are:
data availability: your users may want to query data that you have temporary removed because you're refreshing it (...this depends on your data loading approach).
performance: The reporting may be affected by the data loading processes.
If your data is partitioned, I think it would be a nice approach to use a partitioned switch based data load.
You perform the data load on a staging partition that contains the data that you're reloading (while the datawarehouse partition is still available with all the data for the users). Then, once you have finished loading the data in your staging partition, you can immediately switch the partitions between staging and the datawarehouse. This will solve the data availability problem and could help reducing the performance one (if for instance your staging partition is on a different hard-drive than the datawarehouse).
more info on partitioned data load and other data loading techniques here:
http://msdn.microsoft.com/en-us/library/dd425070(v=sql.100).aspx
Related
Looking for suggesting on loading data from SQL Server into Elasticsearch or any other data store. The goal is to have transactional data available in real time for Reporting.
We currently use a 3rd party tool, in addition to SSRS, for data analytics. The data transfer is done using daily batch jobs and as a result, there is a 24 hour data latency.
We are looking to build something out that would allow for more real time availability of the data, similar to SSRS, for our Clients to report on. We need to ensure that this does not have an impact on our SQL Server database.
My initial thought was to do a full dump of the data, during the weekend, and writes, in real time, during weekdays.
Thanks.
ElasticSearch's main use cases are for providing search type capabilities on top of unstructured large text based data. For example, if you were ingesting large batches of emails into your data store every day, ElasticSearch is a good tool to parse out pieces of those emails based on rules you setup with it to enable searching (and to some degree querying) capability of those email messages.
If your data is already in SQL Server, it sounds like it's structured already and therefore there's not much gained from ElasticSearch in terms of reportability and availability. Rather you'd likely be introducing extra complexity to your data workflow.
If you have structured data in SQL Server already, and you are experiencing issues with reporting directly off of it, you should look to building a data warehouse instead to handle your reporting. SQL Server comes with a number of features out of the box to help you replicate your data for this very purpose. The three main features to accomplish this that you could look into are AlwaysOn Availability Groups, Replication, or SSIS.
Each option above (in addition to other out-of-the-box features of SQL Server) have different pros and drawbacks. For example, AlwaysOn Availability Groups are very easy to setup and offer the ability to automatically failover if your main server had an outage, but they clone the entire database to a replica. Replication let's you more granularly choose to only copy specific Tables and Views, but then you can't as easily failover if your main server has an outage. So you should read up on all three options and understand their differences.
Additionally, if you're having specific performance problems trying to report off of the main database, you may want to dig into the root cause of those problems first before looking into replicating your data as a solution for reporting (although it's a fairly common solution). You may find that a simple architectural change like using a columnstore index on the correct Table will improve your reporting capabilities immensely.
I've been down both pathways of implementing ElasticSearch and a data warehouse using all three of the main data synchronization features above, for structured data and unstructured large text data, and have experienced the proper use cases for both. One data warehouse I've managed in the past had Tables with billions of rows in it (each Table terabytes big), and it was highly performant for reporting off of on fairly modest hardware in AWS (we weren't even using Redshift).
I need to merge data from a mssql server and rest service on the fly. I have been asked to not store the data permanently in the mssql database as it changes periodically (caching would be OK, I believe as long as the cache time was adjustable).
At the moment, I am querying for data, then pulling joined data from a memory cache. If the data is not in cache, I call a rest service and store the result in cache.
This can be cumbersome and slow. Are there any patterns, applications or solutions that would help me solve this problem?
My thought is I should move the cached data to a database table which would speed up joins and have the application periodically refresh the data in the database table. Any thoughts?
You can try Denodo. It allows connecting multiple data source and has inbuild caching feature.
http://www.denodo.com/en
I am working on a project which is highly performance dashboard where results are mostly aggregated mixed with non-aggregated data. First page is loaded by 8 different complex queries, getting mixed data. Dashboard is served by a centralized database (Oracle 11g) which is receiving data from many systems in realtime ( using replication tool). Data which is shown is realized through very complex queries ( multiple join, count, group by and many where conditions).
The issue is that as data is increasing, DB queries are taking more time than defined/agreed. I am thinking to move aggregated functionality to Columnar database say HBase ( all the counts), and rest linear data will be fetched from Oracle. Both the data will be merged based on a key on App layer. Need experts opinion if this is correct approach.
There are few things which are not clear to me:
1. Will Sqoop be able to load data based on query/view or only tables? on continuous basis or one time?
2. If a record is modified ( e.g. status is changed), how will HBase get to know?
My two cents. HBase is a NoSQL database build for fast lookup queries, not to make aggregated, ad-hoc queries.
If you are planning to use a hadoop cluster, you can try hive with parquet storage formart. If you need near real-time queries, you can go with MPP database. A commercial option is Vertica or maybe Redshift from Amazon. For an open-source solution, you can use InfoBrigth.
These columnar options is going to give you a greate aggregate query performance.
What are the common design approaches taken in loading data from a typical Entity-Relationship OLTP database model into a Kimball star schema Data Warehouse/Marts model?
Do you use a staging area to perform the transformation and then load into the warehouse?
How do you link data between the warehouse and the OLTP database?
Where/How do you manage the transformation process - in the database as sprocs, dts/ssis packages, or SQL from application code?
Personally, I tend to work as follows:
Design the data warehouse first. In particular, design the tables that are needed as part of the DW, ignoring any staging tables.
Design the ETL, using SSIS, but sometimes with SSIS calling stored procedures in the involved databases.
If any staging tables are required as part of the ETL, fine, but at the same time make sure they get cleaned up. A staging table used only as part of a single series of ETL steps should be truncated after those steps are completed, with or without success.
I have the SSIS packages refer to the OLTP database at least to pull data into the staging tables. Depending on the situation, they may process the OLTP tables directly into the data warehouse. All such queries are performed WITH(NOLOCK).
Document, Document, Document. Make it clear what inputs are used by each package, and where the output goes. Make sure to document the criteria by which the input are selected (last 24 hours? since last success? new identity values? all rows?)
This has worked well for me, though I admit I haven't done many of these projects, nor any really large ones.
I'm currently working on a small/mid size dataware house. We're adopting some of the concepts that Kimball puts forward, i.e. the star scheme with fact and dimension tables. We structure it so that facts only join to dimensions (not fact to fact or dimension to dimension - but this is our choice, not saying it's the way it should be done), so we flatten all dimension joins to the fact table.
We use SSIS to move the data from the production DB -> source DB -> staging DB -> reporting DB (we probably could have have used less DBs, but that's the way it's fallen).
SSIS is really nice as it's lets you structure your data flows very logically. We use a combination of SSIS components and stored procs, where one nice feature of SSIS is the ability to provide SQL commands as a transform between a source/destination data-flow. This means we can call stored procs on every row if we want, which can be useful (albeit a bit slower).
We're also using a new SQL Server 2008 feature called change data capture (CDC) which allows you to audit all changes on a table (you can specify which columns you want to look at in those tables), so we use that on the production DB to tell what has changed so we can move just those records across to the source DB for processing.
I agree with the highly rated answer but thought I'd add the following:
* Do you use a staging area to perform the transformation and then
load into the warehouse?
It depends on the type of transformation whether it will require staging. Staging offers benefits of breaking the ETL into more manageable chunks, but also provides a working area that allows manipulations to take place on the data without affecting the warehouse. It can help to have (at least) some dimension lookups in a staging area which store the keys from the OLTP system and the key of the latest dim record, to use as a lookup when loading your fact records.
The transformation happens in the ETL process itself, but it may or may not require some staging to help it along the way.
* How do you link data between the warehouse and the OLTP database?
It is useful to load the business keys (or actual primary keys if available) into the data warehouse as a reference back to the OLTP system. Also, auditing in the DW process should record the lineage of each bit of data by recording the load process that has loaded it.
* Where/How do you manage the transformation process - in the
database as sprocs, dts/ssis packages,
or SQL from application code?
This would typically be in SSIS packages, but often it is more performant to transform in the source query. Unfortunately this makes the source query quite complicated to understand and therefore maintain, so if performance is not an issue then transforming in the SSIS code is best. When you do this, this is another reason for having a staging area as then you can make more joins in the source query between different tables.
John Saunders' process explanation is a good.
If you are looking to implement a Datawarehouse project in SQL Server you will find all the information you require for the delivering the entire project within the excellent text "The Microsoft Data Warehouse Toolkit".
Funilly enough, one of the authors is Ralph Kimball :-)
You may want to take a look at Data Vault Modeling. It claims solving some loner term issues like changing attributes.
I am facing an application designed to import huge amounts of data into a Microsoft SQL Server 2000 database. The application seems to take an awful long time to complete and I suspect the application design is flawed. Someone asked me to dig into the application to find and fix serious bottlenecks, if any. I would like a structured approach to this job and have decided to prepare a checklist of potential problems to look for. I have some experience with SQL databases and have so far written down some things to look for.
But it would be very helpful with some outside inspiration as well. Can any of you point me to some good resources on checklists for good database schema design and good database application design?
I plan on developing checklists for the following main topics:
Database hardware - First thing is to establish proof that the server hardware is appropriate?
Database configuration - Next step is to ensure the database is configured for optimal performance?
Database schema - Does the database schema have a sound design?
Database application - does the application incorporate sound algorithms?
Good start. Here are the recommended priorities.
First Principle. Import should do little or no processing other than source file reads and SQL Inserts. Other processing must be done prior to the load.
Application Design is #1. Do the apps do as much as possible on the flat files before attempting to load? This is the secret sauce in large data warehouse loads: prepare offline and then bulk load the rows.
Database Schema is #2. Do you have the right tables and the right indexes? A load doesn't require any indexes. Mostly you want to drop and rebuild the indexes.
A load had best not require any triggers. All that triggered processing can be done off-line to prepare the file for a load.
A load had best not be done as a stored procedure. You want to be using a simple utility program from Microsoft to bulk load rows.
Configuration. Matters, but much, much less than schema design and application design.
Hardware. Unless you have money to burn, you're not going far here. If -- after everything else -- you can prove that hardware is the bottleneck, then spend money.
Three items I would add to the list:
Bulk insert - Are you importing the data using a bulk provider (i.e. BCP or SqlBulkCopy) or via individual INSERT/UPDATE statements?
Indexes - Do you have indexes on your target tables? Can they be dropped prior to import and then rebuilt after?
Locking - Is there any contention occurring while you are trying to import data?
You have left out the first place I would start looking: the technique used to import the data.
If the application is inserting single rows, is there a reason for that? Is it using DTS or BULK INSERT or BCP? Is it loading to a staging table or a table with triggers? Is it loading in batches or attempting to load and commit the entire batch? Is there extensive transformation or data type conversion of rows on their way in? Is there extensive transformation of data into a different schema or model?
I wouldn't worry about 1 and 2 until I saw if the ETL techniques used were sound, and if they are importing data into an existing schema, then you aren't going to have much room to change anything to do with 3. With regard to import and 4, I prefer not to do much algorithms on the data during the load portion.
For the best performance in the most general cases, load to a flat staging table with good reliable basic type conversion and exceptions done at that time (using SSIS or DTS). For very large data (multi-million row daily loads, say), I load in 100,000 or 1,000,000 record batches (this is easily settable in BCP or SSIS). Any derived columns are either created at the time of the load (SSIS or DTS) or right after with an UPDATE. Run exceptions, validate the data and create contraints. Then manipulate the data into the final schema as part of one or more transactions - UPDATEs, INSERTs, DELETEs, GROUP BYs for dimensions or entities or whatever.
Obviously, there are exceptions to this and it depends a lot on the input data and the model. For instance, with EBCDIC packed data in inputs, there's no such thing as good reliable basic type conversion in the load stage, so that causes your load to be more complex and slower as the data has to be processed more significantly.
Using this overall approach, we let SQL do what it is good for, and let client applications (or SSIS) do what they are good for.