Querying large scale table in SQL Server

Querying large scale table in SQL Server - sql-server

I have a table with 200 million records. This table is updated per minute and new records added to it. I want to query in format of a group by and sum function for KPI analysis. What is the best way to query the table without performance drawbacks? Currently, I save the result in a separate table and I updated this table with a SQL Server trigger, but it isn't a good way. Is there any other way you can suggest?

If you use SQL Server 2016 or an upper version of SQL Server, you can use
Real-Time Operational Analytics approach in order to overcome this type
of issue. Real-Time Operational helps to run analytics and OLTP workloads
on the same database. In this way, you can avoid the ETL process.
Real-Time Operational Analytics could be an option for your issue.

Using another table is a good solution if the events are stored in the second table. You can save events by month, weekly, daily, etc. and calculate the system analysis according to it.

Related

Can I use Hadoop to speed up a slow SQL stored procedure?

The problem:
I have 2 SQL Server databases from 2 different applications. They describe different aspects of industrial machines: one is about "how many consumables were spent per order", the other is about "how many good/bad production items were produced per operator". Sometimes many operators are working on 1 order one after another, sometimes one operator is working on multiple small orders, and there is no connection Order-Operator in the database.
I want to have united fact table, where for every timestamp I know MachineID, OrderID and OperatorID. If a timestamp exists in DB1, then the record will have numeric measures from it (Consumables); if it exists in DB2, then it will have numeric measures from DB2 (good/bad production items). If it exists in both databases, then it have all numeric measures. A simple UNION ALL is not enough, because I want to have MachineID, OrderID and OperatorID for every record.
I created a T-SQL stored procedure to make FULL JOIN by timestamp and MachineID. But on large data sets (multiple machines, multiple customers) it becomes very slow. Both applications support editing history, so I need to merge full history from both databases at every nightly load.
To speed up the process, I would like to put calculations into multiple parallel threads, separated by Customer, MachineID, and Year.
I tried to do it by using SQL Server stored procedures, running in parallel by SQL Agent with different parameters, but I found that it didn't help the performance. Instead it created multiple deadlocks when updating staging and final tables.
I am looking for an alternative way to resolve this problem, but I don't know what is the right tool. Can Hadoop or similar parallel processing tool help with this task?
I am looking for solution with minimal cost, because it is needed for just one specific task. For everything else, SQL Server and PowerBI reporting are working just fine for me.

Hadoop seems hard to justify in this use case, given limited scope. The thing about Hadoop is that it scales well not only due to parallel processing but thanks to parallel IO, when data is distributed across multiple servers/storage media. Unless you happy to copy all data to HDFS distributed among multiple nodes, it likely will not help much. If you want to spin up a Hadoop cluster and run multiple jobs querying single SQL server, it'll likely end up badly for the later.
Have you considered optimizations which will allow you to limit the amount of data you processing nightly?
E.g. what is 'timestamp' field? Does it reflect last update time? Can you use it to filter rows which haven't been updated since the previous run?
Even if the 'timestamp' is not the time of last updates, can you add an "updateTime" field and triggers on updates which will populate the field, so you don't need to import rows which have not changed since the previous run? If you build an index on the field, then, if the number of updates during the day is not high relative to total table size, a query with a filter on such field will hit the index, and fetching of incremental changes should be fast.
Another thing to consider - are those DBs running on the same node/SQL server? Access to remote DBs is slow, so if that's the case, think about how to fix this first.

handling performance issues in SSIS

I have billions of records to transmit via SSMS and looking at improving the migration speed. I am trying to save the resultset into a table. I am taking into account my front end would be slow since all that data would be in one table. I am basically looking at various options. So I am thinking of having multiple physical tables. I just need last 5 years of data. Will it be faster If I execute 5 different versions of the same stored procedures with different year filters and populate the tables . I know one can achieve parallelism in SSIS. My only fear is since all the five storedprocedures are running in parallel , will they lock down the tables

You could look into table partitioning schemes in sql server. It seems like the Year column would be a good field to use in your partitioning function.
Table Partitioning in SQL Server

Need Suggestions: Utilizing columnar database

I am working on a project which is highly performance dashboard where results are mostly aggregated mixed with non-aggregated data. First page is loaded by 8 different complex queries, getting mixed data. Dashboard is served by a centralized database (Oracle 11g) which is receiving data from many systems in realtime ( using replication tool). Data which is shown is realized through very complex queries ( multiple join, count, group by and many where conditions).
The issue is that as data is increasing, DB queries are taking more time than defined/agreed. I am thinking to move aggregated functionality to Columnar database say HBase ( all the counts), and rest linear data will be fetched from Oracle. Both the data will be merged based on a key on App layer. Need experts opinion if this is correct approach.
There are few things which are not clear to me:
1. Will Sqoop be able to load data based on query/view or only tables? on continuous basis or one time?
2. If a record is modified ( e.g. status is changed), how will HBase get to know?

My two cents. HBase is a NoSQL database build for fast lookup queries, not to make aggregated, ad-hoc queries.
If you are planning to use a hadoop cluster, you can try hive with parquet storage formart. If you need near real-time queries, you can go with MPP database. A commercial option is Vertica or maybe Redshift from Amazon. For an open-source solution, you can use InfoBrigth.
These columnar options is going to give you a greate aggregate query performance.

Index/Statistics on volatile tables

One of my application has the following use-case:
user inputs some filters and conditions about orders (delivery date ranges,...) to analyze
the application compute a lot of data and save it on several support tables (potentially thousands of record for each analysis)
the application starts a report engine that use data from these tables
when exiting, the application deletes computed record from support tables
Actually I'm analyzing how to ehnance queries performance adding indexes/stastics to support tables and the SQL Profiler suggests me to create 3-4 indexes and 20-25 statistics.
The record in supports tables are costantly created and removed: it's correct to create all this indexes/statistics or there is the risk that all these data will be easily outdated (with the only result of a costant overhead for maintaining indexes/statistics)?
DB server: SQL Server 2005+
App language: C# .NET
Thanks in advance for any hints/suggestions!

First seems like a good situation for a data cube. Second, yes you should update stats before running your query once the support tables are populated. You should disable your indexes when inserting the data. Then the rebuild command will bring your indexes and stats up to date in one go. Profiler these days is usually quite good at these suggestions, but test the combinations to see what actully gives the best performance gains. To look as os cubes here What are the open source tools and techniques to build a complete data warehouse platform?

SQL Server replication for 70 databases with transformation in a small time window

We have 70+ SQL Server 2008 databases that need to be copied from an OLTP environment to a separate reporting server. Once the DB's are copied, we will do some partial data transformation: de-normalization, row level security, etc.
SSRS Reports will be written based on these static denormalized tables and views.
We have a small nightly window for copying and transforming all 70 databases (3 hours).
Currently databases average about 10GB.
Options:
1. Transactional replication:
We would need to create 100+ static denormalized tables on each reporting database.
Doing this for all 70 databases almost reaches our nightly time limit.
As the databases grow we will exceed the time limit. We thought of mixing denormalized tables with views to speed up transformation. But then there would be some dynamic and some static data which is not a solution we can use.
Also with 70 databases using transactional replication we are concerned about bandwidth usage.
2. Snapshot replication:
Copy the entire database each night.
This means we could have a mixture of denormalized tables and views so the data transformation process is quicker.
But the snapshot is a full data copy, so as the DB grows, we will exceed our time limit for completing copy and transformation.
3. Log shipping:
In our nightly window, we could use the log shipping to update the reporting databases, then truncate and repopulate the denormalized tables and use some views.
However, I understand that with log shipping, extra tables and views cannot be added to the subscribing database.
4. Mirroring:
Mirroring is being deprecated, but also the DB is not active for reporting against until failover.
5. SQL Server 2012 AlwaysOn.
We don't have SQL Server 2012 yet, can this be configured to do an update once a day instead of realtime?
And can extra tables and views be created on the subscribing database (our reporting databases)?
6. Merge replication:
This is meant to be for combining multiple data sources into one database.
But is looks like it allows for a scheduled update (once per day) and only updates the subscriber DB with the latest changes rather than doing an entire snapshot.
It requires adding a rowversion column to every table but we could handle this. Also with this solution would additional tables be able to be created on the subscriber database without the update getting out of sync?
The final option is that we use SSIS to select only the data we need from the OLTP databases. I think this options creates more risk as we would have to handle inserts/updates/deletes to our denormalized tables, rather than just drop and recreate the denormalized tables daily.
Any help on our options would be greatly appreciated.
If I've made any incorrect assumptions, please say.

If it were me, I'd go with transactional replication that runs continuously and have views (possibly indexed) at the subscriber. This has the advantage of not having to wait for the data to come over since it's always coming over.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight