Microsoft SQL Server table with > 10 million records - sql-server

i have a master table and corresponding configuration table. Each record in master table can have more than 100 000 records. Master table can have more than 200 records. Which of the following approach is best?
Having separate configuration for each master record
Having single configuration table for all master records with proper indexing and partitioning

You should have a single configuration table for all the masters, creating an individual table for each configuration will be very bad.
If you have individual tables for each configuration, you will end up with a lot of issues like
Low maintainability
You might require writing dynamic queries for fetch the data which
is not good.
To get the data for multiple configurations, you will be required to
use UNION, which will impact the performance.
Any new configuration in the system will lead to code changes.
Fetching data from 100000*200 rows should be fine if your table is indexed properly.
For better performance, you can create a partition on the configuration table on MasterId.

Related

Tuning in large queries in SQL Server

I am new to SQL Server, though have spent sufficient time in Oracle databases. In the current application I am managing contains a lot of denormalized staging tables to receive upstream data.
Views have been created on staging tables each consisting of about 40 tables and multiple joins. These views load datamart tables of the same name as the view in another database.
These views take a lot of time to load the datamart table approx 5hrs. The logic is truncate load i.e. each day whole database is truncated and data is loaded from source system files using format files into staging area tables.
How to tune these views to make the load process faster as the truncate load process has been written on purpose?
You probably need to look into the normal stuff:
Turn on statistics io to see which table causes most of the I/O in the query
Check from the leftmost node in the actual plan that your query plan creation doesn't end up into timeout (because of all the joins)
Look into fat arrows (=a lot of rows being handled) in the plan
Check any expensive operations (sorts, spools, key lookups with big row counts) in the plan
Check the plan for orders of magnitude difference in estimated vs actual number of rows
Don't pay that much attention to the cost percentages in the actual plan, those are just estimates and can be extremely misleading.
Without more details (create table & index clauses, actual query plan) etc. it's quite difficult to give any more detailed information.

Large database configuration

What is the best way to optimize a database with millions of records and hundreds of simultaneous queries?
The database holds 2 significant tables, in a one to many relationship (table1 has a column for the key of table2).
Indexing has been applied for the relevant columns.
Caching is not very effective because each record is being read only few times after it has been inserted or updated, an not in a known time frame.
The data can be logically arranged to be distributed between different databases without the need for cross-database query.
What is the best database engine and configuration for this table structure?
Can something be done in other layers of the application?

insert data from different db server every second

Primary DB have all the raw data every 10 minutes, but it only store for 1 week. I would like to keep all the raw data for 1 year in another DB, and it is different server. How can it possible?
I have created T-query to select the required data from Primary DB. How can it keep update the data from primary DB and insert to secondary DB accordingly? The table has Datetime, would it able to insert new data for latest datetime?
Notes: source data SQL 2012
secondary db SQL 2005
If you are on sql2008 or higher the merge command (ms docs) may be very useful in your actual update process. Be sure to you understand it.
You table containing the full year data sounds like it could be OLAP, so I refer to it that way occasionally (if you don't know what OLAP is, look it up sometime, but it does not matter to this answer)
If you are only updating 1 or 2 tables, log shipping replication and failover may not work well for you, especially since you are not replicating the table due to different retention policies if nothing else. So make sure you understand how replication, etc. work before you go down that path. If these tables are over perhaps 50% of the total database, log shipping style methods might still be your best method. They work well and handle downtime issues for you -- you just replicate the source database to the OLAP server and then update from the duplicate database into your OLAP database.
Doing an update this every second is an unusual requirement. However, if you create a linked server, you be able to insert your selected rows into a staging table on the remote sever and them update from them to your OLAP table(s). If you can reliably update your OLAP table(s) on the remote server in 1 second, you have a potentially useful method. If not, you may fall behind on posting data to your OLAP tables. If you can update once a minute, you may find you are much less likely to fall behind on the update cycle (at the cost of being slightly less current at all times).
You want to consider putting after triggers on the source table(s) that copies the changes to a staging table (still on the source database) into staging table(s) with an identity on this staging table along with a flag to indicate Insert, Update or Delete and you are well positioned to ship updates for one or a few tables instead of the whole database. You don't need to requery your source database repeatedly to determine what data needs to be transmitted, just select top 1000 from from your staging table(s) (order by the staging id) and move them to the remote staging table.
If your fall behind, a top 1000 loop keeps from trying to post to much data in any one cross server call.
Depending on your data, you may be able to optimize storage and reduce log churn by not copying all columns to your staging table, just the staging id and the primary key of the source table and pretend that whatever data is in the source record at the time you post it to the OLAP database accurately reflects the data at the time the record was staged. It won't be 100% accurate on your OLAP table at all times, but it will be accurate eventually.
Cannot over emphasize that you need to accommodate the downtime in your design -- unless you can live with data loss or just wrong data. Even reliable connections are not 100% reliable.

Populating SQL Server databases and creating indexes - which is the most efficient way?

We've got a project site where we have to replicate a legacy database system into SQL Server 2008 on a nightly basis.
We are using the SQL DataWizard tool from Maestro to do the job, and because we cannot get an accurate delta every night, it was decided that we would dump the previous SQL Server database and take a fresh snapshot every night. Several million rows in about 10 different tables. The snapshot takes about 2 hours to run.
Now, we also need to create some custom indexes on the snapshot copy of the data, so that certain BI tools can query the data quickly.
My question is: is it more efficient to create the tables AND the indexes before the snapshot copy is run, or do we just create the table structures first, run the snapshot copy then create the indexes after the tables are populated?
Is there a performance different in the SQL Server database building the index WHILE adding rows vs adding all rows first then creating the indexes on the final data set?
Just trying to work out which way will result in less database server CPU overhead.
When you perform a snapshot replication, the first task is to bulk copy the data. After the data has been copied, primary and secondary indexes are added. The indexes don't exists until the second step is complete. So no, there is no improvement gain by applying an index after the snapshot.

SQL Server to PostgreSQL - Migration and design concerns

Currently migrating from SQL Server to PostgreSQL and attempting to improve a couple of key areas on the way:
I have an Articles table:
CREATE TABLE [dbo].[Articles](
[server_ref] [int] NOT NULL,
[article_ref] [int] NOT NULL,
[article_title] [varchar](400) NOT NULL,
[category_ref] [int] NOT NULL,
[size] [bigint] NOT NULL
)
Data (comma delimited text files) is dumped on the import server by ~500 (out of ~1000) servers on a daily basis.
Importing:
Indexes are disabled on the Articles table.
For each dumped text file
Data is BULK copied to a temporary table.
Temporary table is updated.
Old data for the server is dropped from the Articles table.
Temporary table data is copied to Articles table.
Temporary table dropped.
Once this process is complete for all servers the indexes are built and the new database is copied to a web server.
I am reasonably happy with this process but there is always room for improvement as I strive for a real-time (haha!) system. Is what I am doing correct? The Articles table contains ~500 million records and is expected to grow. Searching across this table is okay but could be better. i.e. SELECT * FROM Articles WHERE server_ref=33 AND article_title LIKE '%criteria%' has been satisfactory but I want to improve the speed of searching. Obviously the "LIKE" is my problem here. Suggestions? SELECT * FROM Articles WHERE article_title LIKE '%criteria%' is horrendous.
Partitioning is a feature of SQL Server Enterprise but $$$ which is one of the many exciting prospects of PostgreSQL. What performance hit will be incurred for the import process (drop data, insert data) and building indexes? Will the database grow by a huge amount?
The database currently stands at 200 GB and will grow. Copying this across the network is not ideal but it works. I am putting thought into changing the hardware structure of the system. The thought process of having an import server and a web server is so that the import server can do the dirty work (WITHOUT indexes) while the web server (WITH indexes) can present reports. Maybe reducing the system down to one server would work to skip the copying across the network stage. This one server would have two versions of the database: one with the indexes for delivering reports and the other without for importing new data. The databases would swap daily. Thoughts?
This is a fantastic system, and believe it or not there is some method to my madness by giving it a big shake up.
UPDATE: I am not looking for help with relational databases, but hoping to bounce ideas around with data warehouse experts.
I am not a data warehousing expert, but a couple of pointers.
Seems like your data can be easily partitioned. See Postgresql documentation about partitioning on how to split data into different physical tables. This lets you manage data at your natural per server granularity.
You can use postgresql transactional DDL to avoid some copying. The process will then look something like this for each input file:
create a new table to store the data.
use COPY to bulk load data into the table.
create any necessary indexes and do any processing that is required.
In a transaction drop the old partition, rename the new table and add it as a partition.
If you do it like this, you can swap out the partitions on the go if you want to. Only the last step requires locking the live table, and it's a quick DDL metadata update.
Avoid deleting and reloading data to an indexed table - that will lead to considerable table and index bloat due to the MVCC mechanism PostgreSQL uses. If you just swap out the underlying table you get a nice compact table and indexes. If you have any data locality on top of the partitioning in your queries then either order your input data on that or if that's not possible use PostgreSQL cluster functionality to reorder the data physically.
To speed up the text searches use a GIN full text index if the constraints are acceptable (can only search at word boundaries). Or a trigram index (supplied by the pg_trgm extension module) if you need to search for arbitrary substrings.

Resources