Do partitions allow multiple bulk loads? - sql-server

I have a database that contains data for many "clients". Currently, we insert tens of thousands of rows into multiple tables every so often using .Net SqlBulkCopy which causes the entire tables to be locked and inaccessible for the duration of the transaction.
As most of our business processes rely upon accessing data for only one client at a time, we would like to be able to load data for one client, while updating data for another client.
To make things more fun, all PKs, FKs and clustered indexes are on GUID columns (I am looking at changing this).
I'm looking at adding the ClientID into all tables, then partitioning on this. Would this give me the functionality I require?

I haven't used the built-in partitioning functionality of SQL Server, but it's something I am particularly interested in. My understanding is that this would solve your problem.
From this article
This allows you to operate on a
partition even with performace
critical operation, such as
reindexing, without affecting the
others.
And a great whitepaper on partitioning by Kimberly L Tripp is here. Well worth a read - I won't even try to paraphrase it - covers it all in a lot of detail.
Hope this helps.

Can you partition on Client ID : Yes, but partitioning is limited to 1000 partitions so that is 1000 clients before it hits a hard limit. The only way to get around that is to start using partitioned views across multiple partitioned tables - it gets a bit messy.
Will is help your locking situation : In SQL 2005 the lock escalation is row -> page -> table, but in 2008 they introduced a new level allowing row -> page -> partition -> table. So it might get round it, depending on your SQL version (unspecified).
If 2008 is not an option, then there is a trace flag (TF 1211 / 1224) feature that turns off lock escalations, but I would not jump in and use it without some serious testing.
The partitioning feature remains an enterprise upwards feature as well which puts some people off.
The most ideal way in which to perform a data load with partitioning, but avoiding locks is to bring the data into a staging table and then swap the data into a new partition - but this requires that the data is somewhat sequence based (such as datetime) so that new data can be brought in to an entirely new partition whilst older data eventually is removed. (rolling the partition window.)

Related

SQL Server In-Memory table use case with huge data

I have SQL Server table with 160+ million records having continuous CRUD operations from UI, batch jobs etc. basically from multiple sources
Currently I have partitioned the table on a column to have better performance on the table.
I came across In-Memory tables which can be used in case of tables with frequent updates and also if updates happening from multiple sources it won't put a lock instead it will maintain row versioning, so concurrent updates is better using this approach.
So what are my options in this case ?
Partition the table or Create In-Memory table
As I have read SQL server is not supporting In-Memory table when table is partitioned.
What is the better option in this case In-Memory table or partitioned table.
It depends.
In-memory tables look great on theory, but you really need to spend time learning the details in order to make the right implementation. You may find some details disturbing. For example:
there are no parallel inserts in in-memory tables which make creation of rows slower compare to parallel insert in traditional table stored in SSD
not all index operations supported by dis-based indexes are available in in-memory table indexes
not all data types are supported
there are both unsupported features and T-SQL constructs
you may need more RAM then you think
If you are ready to pay the price for using Hekaton, you may start with reading its white-paper.
The partitioning itself comes with benefits but there is no guarantee it will heal your system. Only particular queries and case-scenarios can benefit from it. For example, if 99% of your workload is touching the data in one partition you may see no optimization at all. On the other hand, if your reports are based on historical data and your inserts/updates/deletes touch another partition it will be better.
Both of the technologies are good, but need to be examine in details and applied carefully. Often, folks believe that using some new tech will solve their problems, when the problems can be solved just applying some basic concepts.
For example, you said that you are performing CRUD over 160+ millions records. Ask yourself:
is my table normalized - when data is stored in normalized way you gain two things - first, you will perform CRUD only on part of the data, the engine may read only the data that is needed for particular query (without the need to support an index)
are my T-SQL statements write well - row by agonizing row, calling stored procedures in loops or not processing the data in batches are common sources of slow queries
which are the blocking and deadlocked queries - for example, there is a possibility one long running query to block all your inserts - identify these types of issues first and try to resolve them with data pre-calculation (indexed view) or creating covering indexes (which can be filtered with include columns, too)
are readers and writers being blocked - you can try different isolation levels to solve this type of issues - RCSI is the Azure default isolation level. You may need to add more RAM to your RAMDISK used by your TempDB, but since your are looking at Hekaton, this will be easier to test (and rollback) compare to it(or partitioning)

Database tables optimized for both read and write

We have a web service that pumps data into 3 database tables and a web application that reads that data in aggregated format in a SQL Server + ASP.Net environment.
There is so much data arriving to the database tables and so much data read from them and at such high velocity, that the system started to fail.
The tables have indexes on them, one of them is unique. One of the tables has billions of records and occupies a few hundred gigabytes of disk space; the other table is a smaller one, with only a few million records. It is emptied daily.
What options do I have to eliminate the obvious problem of simultaneously reading and writing from- and to multiple database tables?
I am interested in every optimization trick, although we have tried every trick we came across.
We don't have the option to install SQL Server Enterprise edition to be able to use partitions and in-memory-optimized tables.
Edit:
The system is used to collect fitness tracker data from tens of thousands of devices and to display data to thousands of them on their dashboard in real-time.
Way too broad of requirements and specifics to give a concrete answer. But a suggestion would be to setup a second database and do log shipping over to it. So the original db would be the "write" and the new db would be the "read" database.
Cons
Diskspace
Read db would be out of date by the length of time for log tranfser
Pro
- Could possible drop some of the indexes on "write" db, this would/could increase performance
- You could then summarize the table in the "read" database in order to increase query performance
https://msdn.microsoft.com/en-us/library/ms187103.aspx
Here's some ideas, some more complicated than others, their usefulness depending really heavily on the usage which isn't fully described in the question. Disclaimer: I am not a DBA, but I have worked with some great ones on my DB projects.
[Simple] More system memory always helps
[Simple] Use multiple files for tempdb (one filegroup, 1 file for each core on your system. Even if the query is being done entirely in memory, it can still block on the number of I/O threads)
[Simple] Transaction logs on SIMPLE over FULL recover
[Simple] Transaction logs written to separate spindle from the rest of data.
[Complicated] Split your data into separate tables yourself, then union them in your queries.
[Complicated] Try and put data which is not updated into a separate table so static data indices don't need to be rebuilt.
[Complicated] If possible, make sure you are doing append-only inserts (auto-incrementing PK/clustered index should already be doing this). Avoid updates if possible, obviously.
[Complicated] If queries don't need the absolute latest data, change read queries to use WITH NOLOCK on tables and remove row and page locks from indices. You won't get incomplete rows, but you might miss a few rows if they are being written at the same time you are reading.
[Complicated] Create separate filegroups for table data and index data. Place those filegroups on separate disk spindles if possible. SQL Server has separate I/O threads for each file so you can parallelize reads/writes to a certain extent.
Also, make sure all of your large tables are in separate filegroups, on different spindles as well.
[Complicated] Remove inserts with transactional locks
[Complicated] Use bulk-insert for data
[Complicated] Remove unnecessary indices
Prefer included columns over indexed columns if sorting isn't required on them
That's kind of a generic list of things I've done in the past on various DB projects I've worked on. Database optimizations tend to be highly specific to your situation...which is why DBA's have jobs. Some of the 'complicated' answers could be simple if your architecture supports it already.

We failed trying database per custom installation. Plan to recover?

There is a web application which is in production mode for 3 years or so by now. Historically, because of different reasons there was made a decision to use database-per customer installation.
Now we came across the fact that now deployments are very slow.
Should we ever consider moving all the databases back to single one to reduce environment complexity? Or is it too risky idea?
The problem I see now is that it's very hard to merge these databases with saving referential integrity(primary keys of different database' tables can not be obviously differentiated).
Databases are not that much big, so we don't have much benefits of reduced load by having multiple databases.
Your question is quite broad.
a) Ensure that merged databases don't suffer from degraded performance with things like JOIN statements when, say, 1000 databases are merged even though each is small. As for your referential integrity ... which I assume is auto_increment based ... you can replace these relationships by altering the schema and supplanting UUID or a similar unique, non-sequential value. Or even a surrogate key pair in addition to your auto increment PK.
b) Do benchmarking to ensure your application would respond within performance limits
c) Is there a direct ROI for doing this? What are the long term cost benefits vs the expense of migration? Is the decreased complexity worth increased (if any) cost?
d) How does this impact your backup and disaster recovery plans? Does it make them cheaper? Slower? More expensive?
Abstraction and management tools approach:
if it were me, depending on the situation, I would keep the scalability that comes with per-client sharding and create a set of management tools to abstractly create one virtual database. Using these tools you can acquire the simplified management without loosing technical flexibility. I suspect you want to simplify the cost of managing all these databases (based on your deployment statement). Creating a 'control panel' for your farm can be a good way to simplify a complex system (especially when deployments may use different schema versions).
For the migrated data... customer one database UUIDs can start with 10000000, Customer two database UUIDs can start with 20000000. Customer three database UUIDs can start with 30000000.....
In my opinion when you host the database for your customers, a single database that handles multiple customers is a better idea overall. Of course you need to add a "customers" table to record the customers, and a "customer_id" column on all top-level data that is within the table, and include checks in all your SQL to ensure the customer's view is limited to their own data.
I'd set up a new database with the additional columns, and then test it with a dummy customer or three for a while to ensure all bugs are wiped out. Then I'd migrate all the customers across, one by one, doing checks that the data will fit.

Database design: one huge table or separate tables?

Currently I am designing a database for use in our company. We are using SQL Server 2008. The database will hold data gathered from several customers. The goal of the database is to acquire aggregate benchmark numbers over several customers.
Recently, I have become worried with the fact that one table in particular will be getting very big. Each customer has approximately 20.000.000 rows of data, and there will soon be 30 customers in the database (if not more). A lot of queries will be done on this table. I am already noticing performance issues and users being temporarily locked out.
My question, will we be able to handle this table in the future, or is it better to split this table up into smaller tables for each customer?
Update: It has now been about half a year since we first created the tables. Following the advices below, I created a handful of huge tables. Since then, I have been experimenting with indexes and decided on a clustered index on the first two columns (Hospital code and Department code) on which we would have partitioned the table had we had Enterprise Edition. This setup worked fine until recently, as Galwegian predicted, performance issues are springing up. Rebuilding an index takes ages, users lock each other out, queries frequently take longer than they should, and for most queries it pays off to first copy the relevant part of the data into a temp table, create indices on the temp table and run the query. This is not how it should be. Therefore, we are considering to buy Enterprise Edition for use of partitioned tables. If the purchase cannot go through I plan to use a workaround to accomplish partitioning in Standard Edition.
Start out with one large table, and then apply 2008's table partitioning capabilities where appropriate, if performance becomes an issue.
Datawarehouses are supposed to be big (the clue is in the name). Twenty million rows is about medium by warehousing standards, although six hundred million can be considered large.
The thing to bear in mind is that such large tables have a different physics, like black holes. So tuning them takes a different set of techniques. The other thing is, users of a datawarehouse must understand that they are dealing with huge amounts of data, and so they must not expect sub-second response (or indeed sub-minute) for every query.
Partitioning can be useful, especially if you have clear demarcations such as, as in your case, CUSTOMER. You have to be aware that partitioning can degrade the performance of queries which cut across the grain of the partitioning key. So it is not a silver bullet.
Splitting tables for performance reasons is called sharding. Also, a database schema can be more or less normalized. A normalized schema has separate tables with relations between them, and data is not duplicated.
I am assuming you have your database properly normalized. It shouldn't be a problem to deal with the data volume you refer to on a single table in SQL Server; what I think you need to do is review your indexes.
Since you've tagged your question as 'datawarehouse' as well I assume you know some things about the subject. Depending on your goals you could go for a star-schema (a multidemensional model with a fact and dimensiontables). Store all fastchanging data in 1 table (per subject) and the slowchaning data in another dimension/'snowflake' tables.
An other option is the DataVault method by Dan Lindstedt. Which is a bit more complex but provides you with full flexibility.
http://danlinstedt.com/category/datavault/
In a properly designed database, that is not a huge anmout of records and SQl server should handle with ease.
A partioned single table is usually the best way to go. Trying to maintain separate indivudal customer tables is very costly in termas of time and effort and far more probne to errors.
Also examine you current queries if you are experiencing performance issues. If you don't have proper indexing (did you for instance index the foreign key fields?) queries will be slow, if you don't have sargeable queries they will be slow if you used correlated subqueries or cursors, they will be slow. Are you returning more data than is striclty needed? If you have select * anywhere in your production code, get rid of it and only return the fields you need. If you used views that call views that call views or if you used EAV table, you willhave performance iisues at this level. If you allowed a framework to autogenerate SQl code, you may well have badly perforimng queries. Remember Profiler is your friend. Of course you could also have a hardware issue, you need a pretty good sized dedicated server for that number of records. It won't work to run this on your web server or a small box.
I suggest you need to hire a professional dba with performance tuning experience. It is quite complex stuff. Databases desigend by application programmers often are bad performers when they get a real number of users and records. Database MUST be designed with data integrity, performance and security in mind. If you didn't do that the changes of having them are slim indeed.
Partioning is definately something to look into. I had a database that had 2 tables sharded. Each table contained around 30-35million records. I have since merged this into one large table and assigned some good indexes. So far, I've not had to partition this table as it's working a treat, but I'm keep partitioning in mind. One thing that I have noticed, compared to when the data was sharded, and that's the data import. It is now slower, but I can live with that as the Import tool can be re-written ;o)
One table and use table partitioning.
I think the advice to use NOLOCK is unjustified based on the information given. NOLOCK means you will get inaccurate and unreliable results from your queries (dirty and phantom reads). Before using NOLOCK you need to be sure that's not going to be a problem for your customers.
Is this a single flat table (no particular model)? Typically in data warehouses, you either have a normalized data model (third normal form at least - usually in an entity-relationship-model) or you have dimensional data (Kimball method or variations - usually fact tables with associated dimension tables in a set of stars).
In both cases, indexes play a large part, and partitioning can also play a part in getting queries to perform (but partitioning is not usually about performance but about maintenance being able to add and drop partitions quickly) over very large data sets - but it really depends on the order of aggregation and the types of queries.
One table, then worry about performance. That is, assuming you are collecting the exact same information for each customer. That way, if you have to add/remove/modify a column, you are only doing it in one place.
If you're on MS SQL server and you want to keep the single table, table partitioning could be one solution.
Keep one table - 20M rows isn't huge, and customers aren't exactly the kind of table that you can easily 'archive off', and the aggrevation of searching multiple tables to find a customer isn't worth the effort (SQL is likely to be much more efficient at BTree searching than your own invention is)
You will need to look into the performance and locking issues however - this will prevent your db from scaling.
You can also create supplemental tables that hold already calculated details on historical information if there are common queries.

What can cause bad SQL server performance?

Every time I find out that the performance of data retrieval from my database is slow. I try to figure out which part of my SQL query has the problem and I try to optimize it and also add some indexes to the table. But this does not always solve the problem.
My question is :
Are there any other tricks to make SQL server performance better?
What are the other reason which can make SQL server performance worse?
Inefficient query design
Auto-growing files
Too many indexes to be maintained on a table
Too few indexes on a table
Not properly choosing your clustered index
Index fragmentation due to poor maintenance
Heap fragmentation due to no clustered index
Too high FILLFACTORs used on indexes, causing excessive page splitting
Too low of a FILLFACTOR used on indexes, causing excessive space usage and increased scanning time
Not using covered indexes where appropriate
Non-selective indexes being used
Improper maintenance of statistics (out of date statistics)
Databases not normalized properly
Transaction logs and data sharing the same drive spindles
The wrong memory configuration
Too little memory
Too little CPU
Slow hard drives
Failing hard drives or other hardware
A 3D screensaver on your database server chewing up your CPU
Sharing the database server with other processes which compete for CPU and memory
Lock contention between queries
Queries which scan entire large tables
Front end code which searches data in an inefficent manner (nested loops, row by row)
CURSORS which are not necessary and/or are not FAST_FORWARD
Not setting NOCOUNT when you have large tables being cursored through.
Using a transaction isolation level which is too high (such as using SERIALIZABLE when it's not necessary)
Too many round trips between the client and the SQL Server (a chatty interface)
An unnecessary linked server query
A linked server query which targets a table on a remote server with no primary or candidate key defined
Selecting too much data
Excessive query recompilations
oh and there might be some others, too.
When I talk to new developers that have this problem I usually find that it is because of one of two problems. Both of them are fixed if you follow these 2 rules.
First, don’t retrieve any data that you don’t need. For example, if you are doing paging then don’t bring back 100 rows and then calculate which ones belong on the page. Have the stored proc figure it out and only retrieve the 10 you need.
Second, nothing is faster than work you don’t do. For example, I worked on a system where the full roles and rights for a user were retrieved with every page requested – this was 100’s of rows for some users. Even just saving this to session state on the first request and then using it from there for subsequent requests took a meaningful weight off of the database.
Suggest you get a good book on Performance tuning for the database you use (this is very much database specific). This is an extremely complex subject and cannot really be answered other than in generalities on the web.
For instance, Dave markle tell you inefficient queries can cause the problem and there are many many ways to write inefficient queries and many more ways to fix them.
If you're new to the database and you have access to the database engine tuning advisor, you can heuristically tune your database.
You basically capture the SQL queries being run against your DB in the SQL Profiler, then feed those to DETA. DETA effectively runs the queries (without altering your data) and then works out what information your database is missing (views, indexes, partitions, statistics etc.) to do the queries better.
It can then apply them for you and monitor them in the future. I'm not saying to assume that DETA is always right or to do things without understanding, but I've found that it's definately a good way to see what your queries are doing, how long they take, and how you can index the DB appropriately.
PS: With all that said, it's much better to invest in a good DBA at the start of a project so that you have good structures and indexing to start with. But thats not the position that you're in right now...
This is a very wide question. And there is a ton of answers already. Still I would like to add one important factor - Page Split. The problem is – there are good splits and bad splits. Following are good articles explaining how to use transaction_log extended event for identifying bad/nasty page splits
Tracking Problematic Pages Splits in SQL Server 2012 Extended Events - Jonathan Kehayias
Tracking page splits using the transaction log - Paul Randal
You mentioned:
I try to optimize it and also add some indexes
But, sometimes removing unused non-clustered indexes may help to improve performance as it help to reduce transaction logs. Read Top Reasons for Log Performance Problems
Wait statistics, or please tell me where it hurts gives an idea about using wait statistics for performance analysis.
To see some fresh ideas for performance, take a look at
Performance Considerations - sqlmag.com
Separate tables in joins to different disks (for parallel disk I/O - filegroups).
Avoid joins on columns with few unique values.
To understand JOIN, read Advanced JOIN Techniques

Resources