Database design: one huge table or separate tables? - sql-server

Currently I am designing a database for use in our company. We are using SQL Server 2008. The database will hold data gathered from several customers. The goal of the database is to acquire aggregate benchmark numbers over several customers.
Recently, I have become worried with the fact that one table in particular will be getting very big. Each customer has approximately 20.000.000 rows of data, and there will soon be 30 customers in the database (if not more). A lot of queries will be done on this table. I am already noticing performance issues and users being temporarily locked out.
My question, will we be able to handle this table in the future, or is it better to split this table up into smaller tables for each customer?
Update: It has now been about half a year since we first created the tables. Following the advices below, I created a handful of huge tables. Since then, I have been experimenting with indexes and decided on a clustered index on the first two columns (Hospital code and Department code) on which we would have partitioned the table had we had Enterprise Edition. This setup worked fine until recently, as Galwegian predicted, performance issues are springing up. Rebuilding an index takes ages, users lock each other out, queries frequently take longer than they should, and for most queries it pays off to first copy the relevant part of the data into a temp table, create indices on the temp table and run the query. This is not how it should be. Therefore, we are considering to buy Enterprise Edition for use of partitioned tables. If the purchase cannot go through I plan to use a workaround to accomplish partitioning in Standard Edition.

Start out with one large table, and then apply 2008's table partitioning capabilities where appropriate, if performance becomes an issue.

Datawarehouses are supposed to be big (the clue is in the name). Twenty million rows is about medium by warehousing standards, although six hundred million can be considered large.
The thing to bear in mind is that such large tables have a different physics, like black holes. So tuning them takes a different set of techniques. The other thing is, users of a datawarehouse must understand that they are dealing with huge amounts of data, and so they must not expect sub-second response (or indeed sub-minute) for every query.
Partitioning can be useful, especially if you have clear demarcations such as, as in your case, CUSTOMER. You have to be aware that partitioning can degrade the performance of queries which cut across the grain of the partitioning key. So it is not a silver bullet.

Splitting tables for performance reasons is called sharding. Also, a database schema can be more or less normalized. A normalized schema has separate tables with relations between them, and data is not duplicated.

I am assuming you have your database properly normalized. It shouldn't be a problem to deal with the data volume you refer to on a single table in SQL Server; what I think you need to do is review your indexes.

Since you've tagged your question as 'datawarehouse' as well I assume you know some things about the subject. Depending on your goals you could go for a star-schema (a multidemensional model with a fact and dimensiontables). Store all fastchanging data in 1 table (per subject) and the slowchaning data in another dimension/'snowflake' tables.
An other option is the DataVault method by Dan Lindstedt. Which is a bit more complex but provides you with full flexibility.
http://danlinstedt.com/category/datavault/

In a properly designed database, that is not a huge anmout of records and SQl server should handle with ease.
A partioned single table is usually the best way to go. Trying to maintain separate indivudal customer tables is very costly in termas of time and effort and far more probne to errors.
Also examine you current queries if you are experiencing performance issues. If you don't have proper indexing (did you for instance index the foreign key fields?) queries will be slow, if you don't have sargeable queries they will be slow if you used correlated subqueries or cursors, they will be slow. Are you returning more data than is striclty needed? If you have select * anywhere in your production code, get rid of it and only return the fields you need. If you used views that call views that call views or if you used EAV table, you willhave performance iisues at this level. If you allowed a framework to autogenerate SQl code, you may well have badly perforimng queries. Remember Profiler is your friend. Of course you could also have a hardware issue, you need a pretty good sized dedicated server for that number of records. It won't work to run this on your web server or a small box.
I suggest you need to hire a professional dba with performance tuning experience. It is quite complex stuff. Databases desigend by application programmers often are bad performers when they get a real number of users and records. Database MUST be designed with data integrity, performance and security in mind. If you didn't do that the changes of having them are slim indeed.

Partioning is definately something to look into. I had a database that had 2 tables sharded. Each table contained around 30-35million records. I have since merged this into one large table and assigned some good indexes. So far, I've not had to partition this table as it's working a treat, but I'm keep partitioning in mind. One thing that I have noticed, compared to when the data was sharded, and that's the data import. It is now slower, but I can live with that as the Import tool can be re-written ;o)

One table and use table partitioning.
I think the advice to use NOLOCK is unjustified based on the information given. NOLOCK means you will get inaccurate and unreliable results from your queries (dirty and phantom reads). Before using NOLOCK you need to be sure that's not going to be a problem for your customers.

Is this a single flat table (no particular model)? Typically in data warehouses, you either have a normalized data model (third normal form at least - usually in an entity-relationship-model) or you have dimensional data (Kimball method or variations - usually fact tables with associated dimension tables in a set of stars).
In both cases, indexes play a large part, and partitioning can also play a part in getting queries to perform (but partitioning is not usually about performance but about maintenance being able to add and drop partitions quickly) over very large data sets - but it really depends on the order of aggregation and the types of queries.

One table, then worry about performance. That is, assuming you are collecting the exact same information for each customer. That way, if you have to add/remove/modify a column, you are only doing it in one place.

If you're on MS SQL server and you want to keep the single table, table partitioning could be one solution.

Keep one table - 20M rows isn't huge, and customers aren't exactly the kind of table that you can easily 'archive off', and the aggrevation of searching multiple tables to find a customer isn't worth the effort (SQL is likely to be much more efficient at BTree searching than your own invention is)
You will need to look into the performance and locking issues however - this will prevent your db from scaling.

You can also create supplemental tables that hold already calculated details on historical information if there are common queries.

Related

One to One, One to Many, and Many to Many Relationship Performance calculating

I have created a database design for a current project that I am working on. I am new to databases so I read the book "Database Design for Mere Morals" for help and it helped me create my first database design. My question is how will I be able to estimate the performance of all the relationships (One to One, One to Many, and Many to Many)? Thanks for the help!
There are in principle no performance problems with any relationships in a relational database . Performance problems come when you query the data. For that reason, you need to know what type of queries are going to be run. Once you know that (and the volumes of data in affected tables), you can create indexes which will help the queries.
When you find a badly performing query, you use what tools you have to analyse the query and show its execution plan. If you're lucky, some tools (SSMS in SQL Server) can give very good hints about what indexes would help but don't exist. if you don't have tools that can do that, you have to study the plan and work out why, for example, it is reading every row in a 10 million row table rather than just the 5 you get at the end. (The latter one would be due to a missing index on the columns you use to restrict the rows returned, either in JOIN clauses or WHERE clauses or both.)
Indexes are what make a DB perform, given a reasonable design in the first place.

What are the problems with a join between two tables in two different databases?

I am interested in your thoughts about the the pitfalls of joining two or more tables from different databases. I'll try to give an example.
Suppose table Table1 is located in DatabaseA database and Table2 is located in DatabaseB .
Let's say i have a view, in DatabaseA that pulls out some data from Table1, and some other tables in DatabaseA'.
This view is used to push data to another database, let's call this one, unimaginatevely, DatabaseC.
If i need some data from Table2, my instinct is to join directly Table2 in this view, sort of like this table1 inner join DatabaseB..table2 on [some columns]
Doing this is pretty simple and quick, but i have a nagging voice in my head that keeps telling me not to do this. My worries are about not being able to track down all the objects depending on Table2, so if I change something there, I have to be very carefull and remember everywhere i use this table. So, sort of like breaking SRP for this view (and two databases), because this view can change from two different actions (performed on two different databases: Changing Table1 or changing Table2)
I am interested in your opinions. Is this a good or bad idea? What would be the problems with this approach (performance wise, maintainence wise and so on) and if you have a real world experience where this approach either was a big mistake or was a life saver for you.
P.S: I've searched this topic on google and SO, but could not find anything related to this. I will gladly take the minus votes, duplicate questions and other 'reprimands' from SO users just to have a different view on this problem.
P.P.S: I am using SQL Server 2005.
Thank you and hope i made myself clear:)
If they are on the same server, there is no real problem pulling from separate database. In fact, you may want to separate them for good reasons. For instance if you have a combination of transactional tables and lookup tables that are imported from files. The transactional data needs full recovery and frequent transactional log backups to be able to properly restore, the lookup data does not and can benefit from being in a database in simple recovery mode.
We have many different databases our applications use and we cross databases in queries all the time. As long as the indexing is done properly, there has been no noticable performance difference. The biggest potential issue is for data integrity as you can't set up foreign keys across databases. This can be handled in triggers if need be though.
Now when the databases are on different servers, there can be a performance problem and getting the data is more complicated.
Like everything else in SQL, it depends.
At my job, we do this a LOT. We have very large data sets, and separate DBs for header and detail level records, then additional DBs for reports or tables that we build off of other data, etc etc.
There's not really a performance issue from joining across DBs, and in some cases depending on your hardware setup it may be FASTER. If DatabaseA and DatabaseB are on separate physical drives with different controllers, it will likely be faster to run a query joining those than if they were in the same DB on the same volume.
Maintenance can be an issue but no more than for any other database/tables. It's not like you have different versions of the same tables, you just have those tables in different DBs.
The only major drawback is SQL Server does a poor job of showing intra-database dependencies, so you will need to keep track of these yourself. There are some scripts for this and also third party utilities, and I have heard that SQL Server Denali will add additional support for this but I'm not sure if that's accurate.
Your nagging voice is probably right.
Not least of the problems will be how to enforce declarative referential integrity since you cannot create foreign keys between databases, therefore sooner or later you will have to cope with inconsistent or mismatched or incomplete data.
But if you don't care about that, I don't see a problem :-)
Some general themes re cross-database joins:
Foreign keys
As others have pointed out, in the absence of foreign keys, you'll need to roll your own referential integrity. Not a problem in itself, but issues can surface when you're not in control of the data in one or more of the databases.
A related issue is the use of CASE tools. When reverse-engineering a schema, they will overlook links between tables where a FK->PK relationship doesn't exist.
Performance
If the database are on different servers then you're exposed to the vagaries of whatever else is running on those servers as well as the cost of running the join operation itself. Again, if the servers are all within your control, this is something you can monitor but this may may not be the case.
Coupling
If your solution relies on other databases you have multiple points of failure. If a database goes down, this could cascade to one or more systems.
Data modification
Your solution may be coupled to what you believe to be static data in tables on another database. However, what if this were accidentally (or purposefully) amended, duplicated or deleted. Again, if the databases in question are out of your remit, other teams/departments may not be aware of how your system operates.
All this being, true, there are many cases where cross-database joins are the norm. A few examples I've seen:
Mart-Repository
Performant operations take place on the mart whilst the master data stash is kept on the repository. CRUD operations take place between the two on a frequent or infrequent basis (nightly update, real-time etc).
Legacy DB
You might expose a legacy database for data migration and or reporting/auditing purposes.
Lookup
One or more of your databases may contain static lookup information which can be re-used.
So to answer your question - it depends on what exactly you're doing and whether the risk is acceptable. Other solutions exist such as replication but again, how feasible this is will depend on the structure of your department/company.
The answer to your questions is...it depends.
I have noticed that there is no serious degradation in performance when you keep the queries nice and simple (fewer join etc).
The more complex the queries, the more chance that the optimizer will produce a suboptimal execution plan.
The optimizer ultimately gets to decide how to execute the query. The more complex the query, the more opportunity for the optimizer to get the order of operations "wrong".
I recently experimented with this problem...
I ran a query with roughly 8 joins on a single database. I then put up a copy of that database on the same server with a different name, and then I modified the query so that it would join to a couple tables in the second copy of the database.
As a single database query, it ran in under 3 seconds; expected given the volume of data.
The cross database joined query run in just under 3 minutes.
enter code here

Help figuring out approaches to (near) real time multi dimensional data querying

I have a system that involves numerous related tables. Think of a standard category/product/order/customer/orderitem scenario. Some tables are self referencing (like Categories). None of the tables are particularly large (around 100k rows with an estimated scale to around 1 million rows). There are a lot of dimensions to this data I need to consider, but must be queried in a near real time way. I also don't know which dimensions a particular user is interested in- it can be one or many criteria across numerous tables. Things can range from
Give me everything with a category of Jackets
Give me everything with a category of Jackets->Parkas having a red color purchased in the last month in New York
Give me everything which wasn't purchased in New York which costs over $100.
Currently we have a very long SP which uses a "cascading data" approach- we go table by table, filtering everything into a temp table using whatever criteria was specified for that table. For the next table, we join the current temp table to whatever table we're using and apply a new filter set into a new temp table. It works, but manageability and performance is slow. I need something better.
I need a new approach to this problem. It's clearly a need for OLAP, possibly using a star schema. Does this work in real time? Can it be configured to work in real time? Should I use indexed views to create a set of denormalized tables? Should I offload this outside of the database completely?
FYI We're using Sql Server.
As you say, this is perfect for OLAP.
With Sql Server 2005 and 2008 you can set up an almost real time solution. You should:
Create a denormalized star schema
Build an OLAP cube using that schema
Enable proactive caching to update the cube when the underlying data source changes.
It's not a trivial job, and you need the Enterprise version of Sql Server to use proactive caching. You also need some front-end tool (maybe excel would do) to consume the cube.
It would probably be better to build a dynamic query in your code with all the joins you need, customized to each individual request. (properly parameterized for security of course).
You would use much of the same cascading logic you have now but you move it to to the code instead of the database. Then you only submit the exact query you need.
The performance would beat using all of the temp tables and you might get some caching benefit after a few queries were run.
Your dilemma sounds to me like "Is it better to achieve the same result by performing complex processing every time I need it, or should I do it once only for each new piece of data?".

What's the best way to manage a large number of tables in MS SQL Server?

This question is related to another:
Will having multiple filegroups help speed up my database?
The software we're developing is an analytical tool that uses MS SQL Server 2005 to store relational data. Initial analysis can be slow (since we're processing millions or billions of rows of data), but there are performance requirements on recalling previous analyses quickly, so we "save" results of each analysis.
Our current approach is to save analysis results in a series of "run-specific" tables, and the analysis is complex enough that we might end up with as many as 100 tables per analysis. Usually these tables use up a couple hundred MB per analysis (which is small compared to our hundreds of GB, or sometimes multiple TB, of source data). But overall, disk space is not a problem for us. Each set of tables is specific to one analysis, and in many cases this provides us enormous performance improvements over referring back to the source data.
The approach starts to break down once we accumulate enough saved analysis results -- before we added more robust archive/cleanup capability, our testing database climbed to several million tables. But it's not a stretch for us to have more than 100,000 tables, even in production. Microsoft places a pretty enormous theoretical limit on the size of sysobjects (~2 billion), but once our database grows beyond 100,000 or so, simple queries like CREATE TABLE and DROP TABLE can slow down dramatically.
We have some room to debate our approach, but I think that might be tough to do without more context, so instead I want to ask the question more generally: if we're forced to create so many tables, what's the best approach for managing them? Multiple filegroups? Multiple schemas/owners? Multiple databases?
Another note: I'm not thrilled about the idea of "simply throwing hardware at the problem" (i.e. adding RAM, CPU power, disk speed). But we won't rule it out either, especially if (for example) someone can tell us definitively what effect adding RAM or using multiple filegroups will have on managing a large system catalog.
Without first seeing the entire system, my first recommendation would be to save the historical runs in combined tables with a RunID as part of the key - a dimensional model may also be relevant here. This table can be partitioned for improvement, which will also allow you to spread the table into other filegroups.
Another possibility it to put each run in its own database and then detach them, only attaching them as needed (and in read-only form)
CREATE TABLE and DROP TABLE are probably performing poorly because the master or model databases are not optimized for this kind of behavior.
I also recommend talking to Microsoft about your choice of database design.
Are the tables all different structures? If they are the same structure you might get away with a single partitioned table.
If they are different structures, but just subsets of the same set of dimension columns, you could still store them in partitions in the same table with nulls in the non-applicable columns.
If this is analytic (derivative pricing computations perhaps?) you could dump the results of a computation run to flat files and reuse your computations by loading from the flat files.
This seems to be a very interesting problem/application that you are working with. I would love to work on something like this. :)
You have a very large problem surface area, and that makes it hard to start helping. There are several solution parameters that are not evident in your post. For example, how long do you plan to keep the run analysis tables? There's a LOT other questions that need to be asked.
You are going to need a combination of serious data warehousing, and data/table partitioning. Depending on how much data you want to keep and archive you may need to start de-normalizing and flattening the tables.
This would be pretty good case where contacting Microsoft directly can be mutually beneficial. Microsoft gets a good case to show other customers, and you get help directly from the vendor.
We ended up splitting our database into multiple databases. So the main database contains a "databases" table that refers to one or more "run" databases, each of which contains distinct sets of analysis results. Then the main "run" table contains a database ID, and the code that retrieves a saved result includes the relevant database prefix on all queries.
This approach allows the system catalog of each database to be more reasonable, it provides better separation between the core/permanent tables and the dynamic/run tables, and it also makes backups and archiving more manageable. It also allows us to split our data across multiple physical disks, although using multiple filegroups would have done that too. Overall, it's working well for us now given our current requirements, and based on expected growth we think it will scale well for us too.
We've also noticed that SQL 2008 tends to handle large system catalogs better than SQL 2000 and SQL 2005 did. (We hadn't upgraded to 2008 when I posted this question.)

Will having multiple filegroups help speed up my database?

Currently, I am developing a product that does fairly intensive calculations using MS SQL Server 2005. At a high level, the architecture of my product is based on the concept of "runs" where each time I do some analytics it gets stored in a series of run tables (~100 tables per run).
The problem I'm having is that when the number of runs grows to be about 1,000 or so after a few months, performance on the database really seems to drop off, and specifically simple queries like checking for the existence of tables or creating views can take up to a second to two.
I've heard that using multiple filegroups, which I'm not currently doing, could help. Is this true, and if so, why/how would that help? Also, if there are other suggestions, even ones like, use fewer tables, I'm open to them. I just want to speed the database up and hopefully get it in a state where it will scale.
In terms of performance, the big gain in using separate files/filegroups is that it lets you spread your data across multiple physical disks. This is beneficial because with several disks, multiple data requests can be handled simultaneously (parallel is generally faster than serial). All other things being equal, this would tend to benefit performance, but the question of how much depends on your particular data set and the queries you're running.
From your description, the slow operations you're concerned about are creating tables and checking for the existence of tables. If you are generating 100 tables per run, then after 1000 runs you have 100,000 tables. I don't have much experience with creating that many tables in a single database, but you may be pressing the limits of the system tables that track the database schema. In this case, you might see some benefit by spreading your tables across more than one database (these databases could still all live within the same instance of SQL Server).
In general, the SQL Profiler tool is the best starting point for finding slow queries. There are data columns which indicate the CPU and IO cost of each SQL batch, which should point you to the worst offenders. Once you have found the problem queries, I would use the Query Analyzer to generate query plans for each of these queries, and see if you can tell what's making them slow. Do this by opening a query window, entering your query, and hitting Ctrl+L. A complete discussion of what might be slow would fill an entire book, but good things to look for are table scans (very slow for large tables) and inefficient joins.
In the end, you may be able to improve things simply by rewriting your queries, or you may have to make more broad changes to the table schema. For instance, maybe there's a way to create only one or a few tables per run, instead of 1000. More specifics about your particular setup would help us give a more detailed answer.
I also recommend this website for lots of tips on how to make things faster:
http://www.sql-server-performance.com/
When you talk about 100 tables per run, do you actually mean that you're creating new SQL tables? If so, I think that the architecture of your application may be the issue. I can't imagine a situation where you would need that many new tables as opposed to reusing the same few tables multiple times and simply adding a column or two to differentiate between runs.
If you're already reusing the same group of tables and new runs just mean additional rows in those tables, then the issue could simply be that the new data over time is hurting performance in one of several ways. For example:
The tables/indexes could be fragmented after awhile. Make sure that all of your tables have a clustered index. Check for fragmentation using sys.DM_DB_INDEX_PHYSICAL_STATS and issue ALTER INDEX with the REBUILD option if needed to defrag them.
The tables could simply be too large, so that inefficient on small tables are now obvious on the larger tables. Look into proper indexes on the tables to improve performance.
SQL Server will cache query plans (especially for stored procedures), but if the data in a table changes significantly over time that query plan may no longer be appropriate. Look into sp_recompile for your stored procedures to see if that's needed.
#2 is the culprit that I see most often in real world situations. Developers tend to develop using only a small set of test data and overlook proper indexing because you can do almost anything with a table of 20 rows and it will look fast.
Hope this helps
About 1000 of what? Single row writes? Multiple row transactions? Deletes?
A general tip would be to place the data files and log files on separate physical drives. SQL Server keeps track of every write to the log so having those in different drives should give you a general better performance.
But SQL Server tuning depends on what the application is actually doing. There are general tips but you have to measure your own thing...
The file groups being on different physical drives is what will give you the biggest performance boost, can also split up where the indexes are housed so that table writes and index accesses are hitting different disks. There's a lot you can do with partitioning, but that general concept is where the biggest speed impact comes from.
It can help with performance. moving certain tables/elemnts to distinct file areas/portions of the disk. this can reduce to a certain extent the amount of external fragmentation impacting the daabase.
I would also look at other factors such as tracesql to determine why queries etc are slowing down - there can be other factors such as query statistics, SP recompiles etc that are easier to fix and can give you greater gains in performance.
Split the tables across separate physical drives. If you have that much disk IO, you need a decent IO solution. Raid 10, fast disks, split the logs and DBs onto separate drives.
Re-examine your architecture - can you use multiple databases? If you create 1000s of tables in a go, you will soon hit some interesting bottlenecks that I've not had to deal with before. Multiple DBs should solve that. Think about having one "Controlling" db containing all your main meta-data, and then satellite DBs containing the actual data.
You don't mention any specs about your server - but we saw a decent increase in performance when we went from 8GB to 20GB RAM.
It could if you place them on separate drives - not logical but physical drives so IO is not slowing you down so much.

Resources