How to aggregate data from SQL Server 2005 - sql-server

I have about 150 000 rows of data written to a database everyday. These row represent outgoing articles for example. Now I need to show a graph using SSRS that show the average number of articles per day over time. I also need to have a information about the actual number of articles from yesterday.
The idea is to have a aggregated view on all our transactions and have something that can indicate that something is wrong (that we for example send out 20% less articles than the average).
My idea is to have yesterdays data moved into SSAS every night and there store the aggregated value of number of transactions and the actual number of transaction from yesterdays data. Using SSAS would hopefully speed up the reports.
Do you think this is the right idea? Should I skip SSAS and have reports straight on the raw data? I know how use reporting services on raw data using standard SQL queries but how would this change when querying SSAS? I don't know SSAS - where do I start ..?

The neat thing with SSAS is that you can get those indicators that you talk about quite easily either by creating calculated measures or by using KPIs.
I started with Delivering Business Intelligence with Microsoft SQL Server 2005. It had some good introduction, but unfortunately it's too verbose when it comes to the details. But if you want to understand SSAS, OLAP and reporting using this framework it's a good start.
Mosha Pasumansky has a blog on SSAS and MDX with great links.
Other than that I would recommend Microsofts Online books.

Are you sure you aren't mixing up SSAS (Analysis Services) and SSIS (integration services)?
SSAS is not an ETL, it is an OLAP tool.
SSIS is an ETL tool.
I agree with everything that Rowan said. I'm just confused by the terms.

SSAS is an ETL tool. Basically you get data from somewhere (your outgoing articles), do something to it (aggregate), and put it somewhere else (your aggregates table, data warehouse, etc). Check the link for details.
You probably won't be keeping all of the rows in the DB indefinitely and if you want to be able to report on longer trends you need in any case do some kind of aggregating of historical data. So making the reports use this historical data store as their source makes sense. You can then use it to do all kinds of fancy reporting.
TL;DR: Define your aggregated history table with your future reporting needs in mind. Use the SSAS to populate the table and refresh it from the daily updates. Report from that table. Further reading: Star Schemas and data warehousing.

#Sergio and #Rowan
Yes, we're not talking about loading and transforming data into the database (like a SSIS tool would do). That's solved using our integration platform.

#Riri maybe SSAS is overkill for the situation you presented. If you only need to daily populate sumarization tables, you can accomplish it by creating a regular JOB in SQL Server and doing it in a regular T-SQL script.
I've used this approach for several years in a daily process to calculate business indicators from about 9GB new data / day. It works, it's fast, it's simple and it uses a technology you're already used to. If your daily process get's more complicated (it needs to read from files, use FTP, send emails) you can move to a SSIS package (or any other ETL tool you like), but I cannot recommend using SSAS unless you need to provide OLAP capabilities to your users.

Related

Power BI and SQL Server Indexes

I've done some research without getting valuable information about my question.
I'm working on a data warehouse project and one my customer's requirement is that to use power bi pro for data visualisation.
What is not clear to me is if power bi, while acquiring data in its data model, would or not benefit from the indexing structure developed in SQL Server.
Thank you in advance for recommendation/tips on this subject.
It somewhat depends on whether you are using a live connection.
Existing indexes may speed up data loading when using PowerBI in import mode where the data source is a view, query, or stored procedure.
They will also be used in Live mode when connecting to the above sources, and might be used when connecting directly to multiple tables.
As the comments state, if you are bringing entire tables into PowerBI with import mode, then the existing indexes will not benefit you, and the internal SSAS instance that PBI uses is a whole different kettle of fish.
One caveat is that columnstore indexes can be used to get around some of the data size limitations when dealing with the gateway as described here: https://community.powerbi.com/t5/Power-Query/Using-SQL-Server-with-Nonclustered-Columnstore-Index/td-p/563787, but that's not directly related to your question.
Indexes help with retrieval speed on the server end. The answer to how much it will help depends on the specifics of your situation. If you are doing a lot of data transformation and mashup in the Power BI query editor, indexes will only help where there is a step that selects rows from the SQL Server. It won't help with steps where the processing is being done on the Power BI end (such as merging with data from an Excel file or adding custom columns or some forms of substituting values). However, since you mention a data warehouse rather than a simple database, I'm going to assume you're barely doing any transformation on the Power BI end, relying instead on the server end to do the heavy lifting. In that case indexes will definitely help speed things up if they're done strategically
There are some difference between Import mode and Connect live mode.
Import mode:
Data import can be used against any data source type, it can combining Data from different sources. Current Power BI service limitation published file size is 1 GB.
When using import, data are stored in Power BI file/service. Therefore, there is no need to setup permissions on data source side (service account for load is enough) and you can share data publically or with people outside organization. On the other hand, all data are stored on Power BI. It is supported to implement full DAX expressions and full Power Query transformations.
Connect live mode:
There are more limitations for live connection in place. It doesn’t work against all data sources. Current list can be seen here, it cannot combine data from multiple sources.
You are also limited to just one data source/database you selected. You can’t combine data from multiple data sources anymore. If you are connected to SQL Database, you can still create logical relationships between objects from that database as well as measures and calculated columns. When you are connected to SQL Server Analysis Services, you are limited just to report layout and even can’t make calculated columns ,while you can only create measures currently. When using live connection, users have to have access to underlying data source. This means you can’t share outside of your organization or publically. And It is not supported to implement full DAX expressions, only Report Level Measures, to learn more about report level measures, watch this great video from Patrick, and there is no Power Query transformations.
You can learn more: directquery-live-connection-or-import-data-tough-decision

Most Efficient Way to Migrate Un-Normalized Data in an Access Database to a Normalized Form in a SQL Server Database

I've been doing some research on this topic for a while now and can't seem to find a similar instance to my issue. I will try and explain everything as best I can, as simply as I can.
The problem is in the title; I am trying to migrate data from an Access database to SQL Server. Typically, this isn't really a hard problem as there exists several import/export tools within SQL Server but I am looking for the best solution. That or some advice/tips as I am somewhat new to database migration. I will now begin to explain my situation.
So I am currently working on migrating data that exists in an Access “database” (database in quotes because I don’t think it is actually a database, you’ll know why in a minute) in an un-normalized form. What I mean by un-normalized is that all of the data is in one table. This table has about 150+ columns and the rows number in the thousands. Yikes, I know; this is what I’ve walked into lol. Anyways, sitting down and sorting through everything, I’ve designed relationships for the data that normalize it nicely in its new home, SQL Server. Enter my predicament (or at least part of it). I have the normalized database set up to hold the data but I’m not sure how to import it, massage/cut it up, and place it in the respective tables I’ve set up.
Thus far I’ve done a bunch of research into what can be done and for starters I have found out about the SQL Server Migration Assistant. I’ve begun messing with it and was able to import the data from Access into SQL Server, but not in the way I wanted. All I got was a straight copy & paste of the data into my SQL Server database, exactly as it was in the Access database. I then learned about the typical practice of setting up a global table/staging area for this type of migration, but I am somewhat of a novice when it comes to using TSQL. The heart of my question comes down to this; Is there some feature in SQL Server (either its import/export tool or the SSMA) that will allow me to send the data to the right tables that already exist in my normalized SQL Server database? Or do I import to the staging area and write the script(s) to dissect and extract the data to the respective normalized table? If it is the latter, can someone please show me some tips/examples of what the TSQL would look like to do this sort of thing. Obviously I couldn’t expect exact scripts from anyone without me sharing the data (which I don’t have the liberty of as it is customer data), so some cookie cutter examples will work.
Additionally, future data is going to come into the new database from various sources (like maybe excel for example) so that is something to keep in mind. I would hate to create a new issue where every time someone wants to add data to the database, a new import, sort, and store script has to be written.
Hopefully this hasn’t been too convoluted and someone will be willing (and able) to help me out. I would greatly appreciate any advice/tips. I believe this would help other people besides me because I found a lot of other people searching for similar things. Additionally, it may lead to TSQL experts showing examples of such data migration scripts and/or an explanation of how to use the tools that exist in such a way the others hadn’t used before or have functions/capabilities not adequately explained in the documentation.
Thank you,
L
First this:
Additionally, future data is going to come into the new database from
various sources (like maybe excel for example)...?
That's what SSIS is for. Setting up SSIS is not a trivial task but it's not rocket science either. SQL Server Management Studio has an Import/Export Wizard which is a easy-to-use SSIS package creator. That will get you started. There's many alternatives such as Powershell but SSIS is the quickest and easiest solution IMO. Especially when dealing with data from multiple sources.
SSIS works nicely with Microsoft Products as data sources (such as Excel and Sharepoint).
For some things too, you can create an MS Access Front-end that interfaces with SQL Server via sql server stored procedures. It just depends on the target audience. This is easy to setup. A quick google search will return many simple examples. It's actually how I learned SQL server 20+ years ago.
Is there some feature in SQL Server that will allow me to send the
data to the right tables that already exist in my normalized SQL
Server database?
Yes and don't. For what you're describing it will be frustrating.
Or do I import to the staging area and write the script(s) to dissect
and extract the data to the respective normalized table?
This.
If it is the latter, can someone please show me some tips/examples of
what the TSQL would look like to do this sort of thing.
When dealing with denormalized data a good splitter is important. Here's my two favorites:
DelimitedSplit8K
PatternSplitCM
In SQL Server 2016 you also have split_string which is faster (but has issues).
Another must have is a good NGrams function. The link I posted has the function attached at the bottom of the article. I have some string cleaning functions here.
The links I posted have some good examples.
I agree with all the approaches mentioned: Load the data into one staging table (possibly using SSIS) then shred it with T-SQL (probably wrapped up in stored procedures).
This is a custom piece of work that needs hand built scripts. There's no automated tool for this because both your source and target schemas are custom schemas. So you'd need to define all that mapping and rules somewhow.... and no SSIS does not magically do this!
It sounds like you have a target schema and mappings between source and target schema already worked out
As an example your first step is to load 'lookup' tables with this kind of query:
INSERT INTO TargetLookupTable1 (Field1,Field2,Field3)
SELECT DISTINCT Field1,Field2,Field3
FROM SourceStagingTable
TargetLookupTable1 should already have an identity primary key defined (which is not mentioned in the above query because it is auto generated)
This is where you will find your first problem. You'll almost definitely find your distinct query just gives you a whole lot of duplicated mispelt data rubbish data. So before you even load your lookup table you need to do data cleansing.
I suggest you clean the data in your source system directly but it depends how comfortable you are with that.
Next step is: assuming your data is all clean and you've loaded a dozen lookup tables in this way..
Now you need to load transactions but you don't know the lookup key that you just generated!
The trick is to pre-include an empty column for this in your staging table to record this
Once you've loaded up your lookup table you can write the key back into the staging table. This query matches back on the fields you used to load the lookup, and writes the key back into the staging table
UPDATE TGT
SET MyNewLookupKey = NewLookupTable.MyKey
FROM SourceStagingTable TGT
INNER JOIN
NewLookupTable
ON TGT.Field1 = NewLookupTable.Field1
AND TGT.Field2 = NewLookupTable.Field2
AND TGT.Field3 = NewLookupTable.Field3
Now you have a column called MyNewLookupKey in your staging table which holds the correct lookup key to load into you transaction table
Ongoing uploads of data is a seperate issue but you might want to investigate an MS Access Data Project (although they are apparently being phased out, they are very handy for a front end into SQL Server)
The thing to remember is: if there is anything ambiguous about your data, for example, "these rows say my car is black but these rows say my car is white", then you (a human) needs to come up with a rule for "disambiguating" it. It can't be done automatically.
So there are quite a number of ways to skin this cat. I don't know much about the "Migration Assistant", but I somehow doubt it's going to make your life easier given what you're trying to do.
I'd just dump the whole denormalized mess into a single big staging table then shred it where you need it using SQL. I know you asked for help with the TSQL, but without having some idea of what the denormalized data is and how you want to re-shape it, all I can do really is suggest you read up on SQL in general (select, from, where, group by, etc).
You could also do the work in SSIS, but ultimately the solution you use is largely going to depend on the nature of how you need to normalize the big denormalized data set. IMHO doing this in SQL is usually the easiest way, but then again when you're a hammer, everything looks like a nail.
As far as future proofing the process, how you import the Access data probably will have little bearing on how you'd import Excel data. If you have a whole lot of different data sources which you'll need to incorporate on a recurring basis, SSIS might be a good choice to invest some time and effort into for the long run. No matter what, incorporating data from a distinct data source takes time and effort. You'll have to do some extra work no matter what. I would weight how frequently you think you'll have to integrate a given data source, and how much effort is involved to massage it into the format you want.
I have a completely different opinion. Because I do both database development and Microsoft's Power BI - - on the PBI side we come across a lot of non-normalized data because a lot of the data is coming in from excel.
My guess is that what is now in Access was an import of something originally began in excel.
Excel Power Query and PBI offers transforms to pivot and unpivot layout. I would use these tools to do that task. Then import the results into SQL.

Do I need a cube?

We have a content ingestion system which receives (mobile) digital contents of different types (Music, Ringtone, Video, Game, Wallpaper etc) from various providers (Sony, Universal Music, EA Games etc) and then dispatches them across several online stores (e.g. Store1, Store2 etc).
The managers want to know how many of each content type, in a given time window, has been come through from each suppliers and they have gone to which store!
To me it seems like a report that needs an OLAP cube. Am I correct? The problem is that I am a .NET developer and not much skilled in BI and SQ Server Analysis Services therefore I want to make this simple yet flexible and meaningful. Is there an easier way of having a reporting cube, and a data mart to produce reports like this? (I am not sure if we can purchase SSAS and SSIS licenses at all).
And for such data mart and cube, what structure is suggested?
From your description, a cube isn't necessary. Assuming this data is in a database you can just write a query to get that result. If you've bought a licence of SQL Server (i,e, not the free edition) then you already have SSAS, SSIS, SSRS.
Some of a cube's main advantages are:
It's easier for end users to do adhoc reporting
Performance is often better than a relational (SQL Query) source
Some disadvantages are:
You need to spend processing time 'building' the cube
The query language (MDX) can be a challenge to learn
You don't have an adhoc user analysis requirement here
An SSAS cube presented in Excel Pivot Tables is probably still the most powerful and flexible end-user query tool out there, with a very low learning curve (most managers/analysts can already use Excel). Once they have a cube they can satisfy many requirements themselves, without you needing to constantly tweak queries. Even when they do want something more complex, you have a perfect source for report/query design and testing.
But designing and building an SSAS cube is very difficult and they are quite obscure to debug.
I suggest starting with Power Pivot - it's a free Excel Add-In that builds an in-memory cube, and presents the results as Excel Pivot Tables. It scales well through advanced compression and the resulting Model can be published to an SSAS Tabular server. The calculation language is DAX which is an improvement on the horrible MDX - DAX reads more like Excel functions.
This site is probably the best starting point for Power Pivot:
http://www.powerpivotpro.com/
You can solve this with just standard queries or views in SQL Server. Tools such as PowerPivot for Excel also allow you to create local cubes with very little effort.
Of course, purchasing an SSAS license and moving to a cube environment has several advantages, despite the extra cost:
Cubes are faster and allow for more complex calculations than SQL
Queries
With the introduction of the SSAS Tabular Model, making cubes really isn't hard anymore
Creating cubes often forces you to clean up your data model, which has a positive effect on your architecture overall in most cases
Create a cube might be overkilled for your scenario as your data is not quite complicate and not so big. But excel might not enough as it is hard to pivot data in your database directly.
You can try embed WebPivotTable into your website or your application. It provide all functions of excel pivot table and can be connect to CSV/Excel files or connect to database by web service interface. It is web based and the front end user interface are quite intuitive so that users can easily get what he want by simple drag and drops. Here is demo and Documents.
Of course, if you still want to create a cube, this tool can also be very helpful as it can connect to SSAS cubes directly.

Practical Implementation for Data Warehouse

Data warehousing seems to be a big trend these days, and is very interesting to me. I'm trying to acquaint myself with its concepts, and am having a problem "seeing the forest through the trees" because all of the data warehouse models and descriptions I can find online are theoretical, but don't gives examples with actual technologies being used. I'm a contextual learner, so abstracted, theoretical explanations don't really help me out all that much.
Now there seem to be many "data warehousing models", but all of them seem to have some similar characteristics. There is ually an "ODS" (operational data store that aggregates data from multiple sources into the same place. A process known as "ETL" then converts data in this ODS into a "data vault", and again into "data" and/or "strategy marts."
Can someone provide an example of the technologies that would be used for each of these components (ODS, ETL, data vault, data/strategy marts)?
It sounds like the ODS could just be any ordinary database, but the data vault seems to have some special things going on because it is used by these "marts" to pull data from.
ETL is the biggest thing I'm choking on by far. Is this a language? A framework? An algorithm?
I think once I see a concrete example of what's going on at each step of the way, I'll finally get it. Thanks in advance!
ETL is a process. The abbreviation stands for Extract-Transform-Load which describes what is being done with data during the process. The process can be implemented anywhere where you need to create a bridge between two systems with differenet data formats. First, you need to pull (exract) data from a source system (database, flat files, web service etc.), Then data are being processed (transform) to comply with format of a target storage (again it can vary: databases, files, API calls). During the transform step, further actions can be performed on the data set as enrichment with data from other sources, cleansing and improving its quality. The last step is loading transformed data into a target storage.
Typically, an ETL process is employed for loading a datawarehouse, migrating data from one system or database to another during moving from a legacy system to new one, synchronizing data between two or more systems. It is also used as an intermediate layer in broader MDM and BI solutions.
In terms of specific software, there are many ETL tools on the market ranging from robust solutions from big players as Informatica, IBM DataStage, Oracle Data Integrator, to more affordable and open source providers as CloverETL, Talend, or Pentaho. The most of these tools offer a GUI where flow and processing of data is defined through diagrams.
For Microsoft SQL Server 2005 and later the ETL tool is called SSIS (SQL Server Integration Services). If you install at least the Standard version of the SQL Server you get the Business Intelligence Developer Studio with which you can design your data flows. Basically what an ETL tool does is take data from one or more sources (tables, flat files, ...) then transform it (add columns, join, filter, map to different data types, etc.) and finally store it again to one or more tables or files.
To get a basic understanding of how something works you can watch e.g. this video or this one (both from midnightdba). They're a bit lengthy, but you get an idea. They certainly helped me in understanding the basic functionality of an ETL tool.
Unfortunately I have not yet digged into other platforms or tools.
I'd highly recommend checking out some of the books by Ralph Kimball and Margy Ross (The Data Warehouse Toolkit, The Data Warehouse Lifecycle Toolkit) for an introduction to data warehousing.
My company's data warehouse is built using the Oracle Warehouse Builder tool for ETL. The OWB is a GUI tool that generates PL/SQL code on the database to manipulate the data. After manipulation and cleansing, the data is published to an Oracle datamart. The datamart is a database instance that users access for ad-hoc querying via Oracle Discoverer (Java software).

What is the best way to import standalone data into a database?

A little background:
I have a remote, stand alone SQL Server database that is truncated at the end of every weekend. The data is hardly relational, not normalized at all, and pretty annoying to work with. On top of that, the schema for this database cannot be modified at all, because it is recreated by a third party application. Before the database is destroyed each week, a backup is created of that week's data. On average each database will have between 500,000 and 2,000,000 records.
My task is to create a historical version of this database that is a superset of all of these database backups. It should tie into our other databases which contain related sets of information. I have already started on an application to perform this task, and I've gotten to the point where I'm able to match data with our other databases, but I'm wondering if theres any best practice to handling this kind of import.
How do I make sure that I have unique IDs in my historical version of this database? Are there any features in SQL Server that can do some of the heavy lifting in this for me?
Thanks for your time on this.
There's definitely a feature in SQL Server that can assist you and that feature is called SSIS (SQL Server Integration Services). One of the main uses of SSIS is for ETL (Extract, Transform, Load), which means extracting data from several diverse source, transforming it into whatever you need to get into your destination database (such as a data warehouse - any linking with existing data will also happen here), and finally loading it into your destination DB.
I think the best way to get started, if that's what you want of course, is to pick up a good book on SSIS and go through it. While reading, don't forget to play around with the BIDS (Business Intelligence Development Studio - one of the SQL Server tools) to create some test packages.
Furthermore, on the internet you'll find plenty of "getting started" articles.
For your case in particular what I would do is:
create a generic package that can import the data from a source DB (one of your weekly DBs) and insert it into the destination DB - this package can be parameterized using Parent Package Configuration.
create a main package that loops over all backups in a certain folder, restores them one by one and calls the generic import package for each restore. After each successful import, the Control Flow would delete the previously-restored DB.
I think I've given you enough material to investigate on now :-)
Good luck,
Valentino.

Resources