SQL Server 2005 - Number of rows (recommended) in a table - sql-server

Recently we started to get some performance issues on our SQL Server.
On analysis I found the DBA has got 800 millions rows in ONE TABLE (300 GB in size)
No partitioning, no proper indexes - lead to performance going down.
ADVICE:
How many number of rows would one recommended for a table in SQL Server 2005

There is no "recommended" number.
You should only hold data the you use. If you don't use it, archive it.
If you do need it and you have performance problems, you DBA should be able to tune the DB. With that number of rows (not unusual), indexing and ensuring the SAN is working properly should do the trick. Horizontal scaling is another option.

Oracle user here (never used MS SQL server with such large number of rows)
I can say that in all the systems I've worked with, all the tables having hundreds of millions of rows just had to be partitioned.
According to this document you should also have such big table partitioned in MS SQL as well. http://msdn.microsoft.com/en-us/library/ms345146(v=sql.90).aspx

There should be no real limit on the number of rows in a single table as long as it is properly indexed - 800 million doesn't strike me as that many.
What counts as "properly indexed" will depend completely on the application and the table.

I see alot of "no limit" answers, but I'm going to disagree. Unless you have megabucks worth of hardware, this table should be partitioned. The fact that there are 800 million rows tells me that either a.) this is a fact table in the data warehouse (and should be partitioned anyway) OR b.) the dba has been asleep at the wheel.
I'm thinking b (or maybe collectively a and b). I can't imagine being the dba and letting a table get to 800 million records without some sort of intervention. I like to be proactive and this is a big red flag that the dba has no plan for aging the data. It's either growing very quickly or is completely unmanaged.

Related

SQL Server - Inserting new data worsens query performance

We have a 4-5TB SQL Server database. The largest table is around 800 GB big containing 100 million rows. 4-5 other comparable tables are 1/3-2/3 of this size. We went through a process to create new indexes to optimize performance. While the performance certainly improved we saw that the newly inserted data was slowest to query.
It's a financial reporting application with a BI tool working on top of the database. The data is loaded overnight continuing in the late morning, though the majority of the data is loaded by 7am. Users start to query data around 8am through the BI tool and are most concerned with the latest (daily) data.
I wanted to know if newly inserted data causes indexes to go out of order. Is there anything we can do so that we get better performance on the newly inserted data than the old data. I hope I have explained the issue well here. Let me know in case of any missing information. Thanks
Edit 1
Let me describe the architecture a bit.
I have a base table (let’s call it Base) with Date,id as clustered index.
It has around 50 columns
Then we have 5 derived tables (Derived1, Derived2,...) , according to different metric types, which also have Date,Id as clustered index and foreign key constraint on the Base table.
Tables Derived1 and Derived2 have 350+ columns. Derived3,4,5 have around 100-200 columns. There is one large view created to join all the data tables due limitations of the BI tool. The date,ID are the joining columns for all the tables joining to form the view (Hence I created clustered index on those columns). The main concern is with regard to BI tool performance. The BI tool always uses the view and generally sends similar queries to the server.
There are other indexes as well on other filtering columns.
The main question remains - how to prevent performance from deteriorating.
In addition I would like to know
If NCI on Date,ID on all tables would be better bet in addition to the clustered index on date,ID.
Does it make sense to have 150 columns as included in NCI for the derived tables?
You have about a 100 million rows, increasing every day with new portions and those new portions are usually selected. I should use partitioned indexes with those numbers and not regular indexes.
Your solution within sql server would be partitioning. Take a look at sql partitioning and see if you can adopt it. Partitioning is a form of clustering where groups of data share a physical block. If you use year and month for example, all 2018-09 records will share the same physical space and easy to be found. So if you select records with those filters (and plus more) it is like the table has the size of 2018-09 records. That is not exactly accurate but its is quite like it. Be careful with data values for partitioning - opposite to standard PK clusters where each value is unique, partitioning column(s) should result a nice set of different unique combinations thus partitions.
If you cannot use partitions you have to create 'partitions' yourself using regular indexes. This will require some experiments. The basic idea is data (a number?) indicating e.g. a wave or set of waves of imported data. Like data imported today and the next e.g. 10 days will be wave '1'. Next 10 days will be '2' and so on. Filtering on the latest e.g. 10 waves, you work on the latest 100 days import effectively skip out all the rest data. Roughly, if you divided your existing 100 million rows to 100 waves and start on at wave 101 and search for waves 90 or greater then you have 10 million rows to search if SQL is put correctly to use the new index first (will do eventually)
This is a broad question especially without knowing your system. But one thing that I would try is manually update your stats on the indexes/table once you are done loading data. With tables that big, it is unlikely that you will manipulate enough rows to trigger an auto-update. Without clean stats, SQL Server won't have an accurate histogram of your data.
Next, dive into your execution plans and see what operators are the most expensive.

Handling large datasets with SQL Server

I'm looking to manage a large dataset of log files. There is an average of 1.5 million new events per month that I'm trying to keep. I've used access in the past, though it's clearly not meant for this, and managing the dataset is a nightmare, because I'm having to split the datasets into months.
For the most part, I just need to filter event types and count the number. But before I do a bunch of work on the data import side of things, I wanted to see if anyone can verify that this SQL Server is a good choice for this. Is there an entry limit I should avoid and archive entries? Is there a way of archiving entries?
The other part is that I'm entering logs from multiple sources, with this amount of entries, is it wise to put them all into the same table, or should each source have their own table, to make queries faster?
edit...
There would be no joins, and about 10 columns. Data would be filtered through a view, and I'm interested to see if the results from a select query that filter based on one or more columns would have a reasonable response time? Does creating a set of views speed things up for frequent queries?
In my experience, SQL Server is a fine choice for this, and you can definitely expect better performance from SQL Server than MS-Access, with generally more optimization methods at your disposal.
I would probably go ahead and put this stuff into SQL Server Express as you've said, hopefully installed on the best machine you can use (though you did mention only 2GB of RAM). Use one table so long as it only represents one thing (I would think a pilot's flight log and a software error log wouldn't be in the same "log" table, as an absurdly contrived example). Check your performance. If it's an issue, move forward with any number of optimization techniques available to your edition of SQL Server.
Here's how I would probably do it initially:
Create your table with a non-clustered primary key, if you use a PK on your log table -- I normally use an identity column to give me a guaranteed order of events (unlike duplicate datetimes) and show possible log insert failures (missing identities). Set a clustered index on the main datetime column (you mentioned that your're already splitting into separate tables by month, so I assume you'll query this way, too). If you have a few queries that you run on this table routinely, by all means make views of them but don't expect a speedup by simply doing so. You'll more than likely want to look at indexing your table based upon the where clauses in those queries. This is where you'll be giving SQL server the information it needs to run those queries efficiently.
If you're unable to get your desired performance through optimizing your queries, indexes, using the smallest possible datatypes (especially on your indexed columns) and running on decent hardware, it may be time to try partitioned views (which require some form of ongoing maintenance) or partitioning your table. Unfortunately, SQL Server Express may limit you on what you can do with partitioning, and you'll have to decide if you need to move to a more feature-filled edition of SQL Server. You could always test partitioning with the Enterprise evaluation or Developer editions.
Update:
For the most part, I just need to filter event types and count the number.
Since past logs don't change (sort of like past sales data), storing the past aggregate numbers is an often-used strategy in this scenario. You can create a table which simply stores your counts for each month and insert new counts once a month (or week, day, etc.) with a scheduled job of some sort. Using the clustered index on your datetime column, SQL Server could much more easily aggregate the current month's numbers from the live table and add them to the stored aggregates for displaying the current values of total counts and such.
Sounds like one table to me, that would need indexes on exactly the sets of columns you will filter. Restricting access through views is generally a good idea and ensures your indexes will actually get used.
Putting each source into their own table will require UNION in your queries later, and SQL-Server is not very good optimizing UNION-queries.
"Archiving" entries can of course be done manually, by moving entries in a date-range to another table (that can live on another disk or database), or by using "partitioning", which means you can put parts of a table (e.g. defined by date-ranges) on different disks. You have to plan for the partitions when you plan your SQL-Server installation.
Be aware that Express edition is limited to 4GB, so at 1.5 million rows per month this could be a problem.
I have a table like yours with 20M rows and little problems querying and even joining, if the indexes are used.

How to avoid sql server page fragmentation in this scenario?

I want to order SQL Inserts into a table to optimize page use by avoiding fragmentation as much as possible.
I will be running a .net Windows Service, which every 2 hours will take some data from a database and optimize it
for future queries. A varchar(6000) column is involved, though I estimate it will rarely go beyond 4000 bytes.
In fact, this column can vary normally between 600 and 2400.
It's 6000 to help avoiding truncating errors. Still I can control that column size through .net.
There won't ever be updates nor delete. Just selects (and inserts every 2 hours).
There will be around 1000 inserts every 2 hours.
I'm using Sql Server 2005. Page size are said to be 8096 bytes.
I need to insert rows in a table. Given the size of rows, between 4 and 12 rows could fit in a page.
So from .net I will read data from database, store it in memory, (use some clustering algorithm maybe?), and the insert around 1000 rows.
I was wondering if there is a way to avoid or minimize page fragmentation in this scenario.
Is the table a btree or a heap? Do you have a clustered index on it? If yes, then what column is the clustered index on, and how is the column value computed at insert?
Why do you care about fragmentation to start with? Space consideration or read ahead performance? For space, you should skip SQL 2005 and go to SQL 2008 for Page compression. For read ahead, it would be worth investigating why you need large read aheads to start with.
Overall, index fragmentation is more of an overhiped bru-ha-ha everyone talks about but very few really understand. There are many many more aveanues to pursue before fragmentation becomes the real bottleneck.

Are these tables too big for SQL Server or Oracle

I'm not much of a database guru so I would like some advice.
Background
We have 4 tables that are currently stored in Sybase IQ. We don't currently have any choice over this, we're basically stuck with what someone else decided for us. Sybase IQ is a column-oriented database that is perfect for a data warehouse. Unfortunately, my project needs to do a lot of transactional updating (we're more of an operational database) so I'm looking for more mainstream alternatives.
Question
Given these tables' dimensions, would anyone consider SQL Server or Oracle to be a viable alternative?
Table 1 : 172 columns * 32 million rows
Table 2 : 453 columns * 7 million rows
Table 3 : 112 columns * 13 million rows
Table 4 : 147 columns * 2.5 million rows
Given the size of data what are the things I should be concerned about in terms of database choice, server configuration, memory, platform, etc.?
Yes, both should be able to handle your tables (if your server is suited for it). But, I would consider redesigning your database a bit. Even in a datawarehouse where you denormalize your data, a table with 453 columns is not normal.
It really depends on what's in the columns. If there are lots of big VARCHAR columns -- and they are frequently filled to near capacity -- then you could be in for some problems. If it's all integer data then you should be fine.
453 * 4 = 1812 # columns are 4 byte integers, row size is ~1.8k
453 * 255 = 115,515 # columns are VARCHAR(255), theoretical row size is ~112k
The rule of thumb is that row size should not exceed the disk block size, which is generally 8k. As you can see, your big table is not a problem in this regard if it consists entirely of 4-byte integers but if it consists of 255-char VARCHAR columns then you could be exceeding the limit substantially. This 8k limit used to be a hard limit in SQL Server but I think these days it's just a soft limit and performance guideline.
Note that VARCHAR columns don't necessarily consume memory commensurate with the size you specify for them. That is the max size, but they only consume as much as they need. If the actual data in the VARCHAR columns is always 3-4 chars long then size will be similar to that of integer columns regardless of whether you created them as VARCHAR(4) or VARCHAR(255).
The general rule is that you want row size to be small so that there are many rows per disk block, this reduces the number of disk reads necessary to scan the table. Once you get above 8k you have two reads per row.
Oracle has another potential problem which is that ANSI joins have a hard limit on the total number of columns in all tables in the join. You can avoid this by avoiding the Oracle ANSI join syntax. (There are equivalents that don't suffer from this bug.) I don't recall what the limit is or which versions it applies to (I don't think it's been fixed yet).
The numbers of rows you're talking about should be no problem at all, presuming you have adequate hardware.
With suitable sized hardware and I/O subsystem to meet your demands both are quite adequate - Wihlst you have a lot of columns the row counts are really very low - we regularily use datasets that are expressed in billions, not millions. (Just do not try it on SQL 2000 :) )
If you know your usages and I/O requirements, most I/O vendors will translate that into hardware specs for you. Memory, processors etc again is dependant on workloads that only you can model.
Oracle 11g has no problems with such data and structure.
More info at: http://neworacledba.blogspot.com/2008/05/database-limits.html
Regards.
Oracle limitations
SQL Server limitations
You might be close on SQL Server, depending on what data types you have in that 453 column table (note the bytes per row limitation, but also read the footnote). I know you said that this is normalized, but I suggest looking at your workflow and considering ways of reducing the column count.
Also, these tables are big enough that hardware considerations are a major issue with performance. You'll need an experienced DBA to help you spec and set up the server with either RDBMS. Properly configuring your disk subsystem will be vital. You will probably also want to consider table partitioning among other things to help with performance, but this all depends on exactly how the data is being used.
Based on your comments in the other answers I think what I'd recommend is:
1) Isolate which data is actually updated vs. which data is more or less read only (or infrequently)
2) Move the updated data to separate tables joined on an id to the bigger tables (deleting those columns from the big tables)
3) Do your OLTP transactions against the smaller, more relational tables
4) Use inner joins to hook back up to the big tables to retrieve data when necessary.
As others have noted you are trying to make the DB do both OLTP and OLAP at the same time and that is difficult. Server settings need to be tweaked differently for either scenario.
Either SQL Server or Oracle should work. I use census data as well and my giganto table has around 300+ columns. I use SQL Server 2005 and it complains that if all the columns were to be filled to their capacity it would exceed that max possible size for a record. We use our census data in an OLAP fashion, so it isn't such a big deal to have so many columns.
Are all of the columns in all of those tables updated by your application?
You could consider having data marts (AKA operational or online data store) that are updated during the day, and then the new records are migrated into the main warehouse at night? I say this because rows with massive amounts of columns are going to be slower to insert and update, so you may want to consider tailoring your specific online architecture to your application's update requirements.
Asking one DB to act as an operational and warehouse system at the same time is still a bit of a tall order. I would consider using SQL server or Oracle for operational system and having a separate DW for reporting and analytic, probably keeping the system you have.
Expect some table re-design and normalization to happen on the operational side to fit one-row per page limitations of row-based storage.
If you need to have fast updates of the DW, you may consider EP for ETL approach, as opposed to standard (scheduled) ETL.
Considering that you are in the early stage of this, take a look at Microsoft project Madison, which is auto-scalable DW appliance up to 100s TB. They have already shipped some installations.
I would very carefully consider switching from a column oriented database to a relational one. Column oriented databases are indeed inadequate for operational work as updates are very slow, but they are more than adequate for reporting and business intelligence support.
More often than not one has to split the operational work into a OLTP database containing the current activity needed for operations (accounts, inventory etc) and use an ETL process to populate the data warehouse (history, trends). A column oriented DW will beat hands down a relational one in almost any circumstance, so I wouldn't give up the Sybase IQ so easily. Perhaps you can design your system to have an operational OLTP side using your relational product of choice (I would choose SQL Server, but I'm biased) and keep the OLAP part you have now.
Sybase have a product called RAP that combines IQ with an in-memory instance of ASE (their relational database) which is designed to help in situations such as this.
Your data isn't so vast that you couldn't consider moving to a row-oriented database but, depending on the structure of the data, you could end up using considerably more disk space and slowing down many kinds of queries.
Disclaimer: I do work for Sybase but not currently on the ASE/IQ/RAP side.

Very large tables in SQL Server

We have a very large table (> 77M records and growing) runing on SQL Server 2005 64bit Standard edition and we are seeing some performance issues. There are up to a hundred thousand records added daily.
Does anyone know if there is a limit to the number of records SQL server Standard edition can handle? Should be be considering moving to Enterprise edition or are there some tricks we can use?
Additional info:
The table in question is pretty flat (14 columns), there is a clustered index with 6 fields, and two other indexes on single fields.
We added a fourth index using 3 fields that were in a select in one problem query and did not see any difference in the estimated performance (the query is part of a process that has to run in the off hours so we don't have metrics yet). These fields are part of the clustered index.
Agreeing with Marc and Unkown above ... 6 indexes in the clustered index is way too many, especially on a table that has only 14 columns. You shouldn't have more than 3 or 4, if that, I would say 1 or maybe 2. You may know that the clustered index is the actual table on the disk so when a record is inserted, the database engine must sort it and place it in it's sorted organized place on the disk. Non clustered indexes are not, they are supporting lookup 'tables'. My VLDBs are laid out on the disk (CLUSTERED INDEX) according to the 1st point below.
Reduce your clustered index to 1 or 2. The best field choices are the IDENTITY (INT), if you have one, or a date field in which the fields are being added to the database, or some other field that is a natural sort of how your data is being added to the database. The point is you are trying to keep that data at the bottom of the table ... or have it laid out on the disk in the best (90%+) way that you'll read the records out. This makes it so that there is no reorganzing going on or that it's taking one and only one hit to get the data in the right place for the best read. Be sure to put the removed fields into non-clustered indexes so you don't lose the lookup efficacy. I have NEVER put more than 4 fields on my VLDBs. If you have fields that are being update frequently and they are included in your clustered index, OUCH, that's going to reorganize the record on the disk and cause COSTLY fragmentation.
Check the fillfactor on your indexes. The larger the fill factor number (100) the more full the data pages and index pages will be. In relation to how many records you have and how many records your are inserting you will change the fillfactor # (+ or -) of your non-clustered indexes to allow for the fill space when a record is inserted. If you change your clustered index to a sequential data field, then this won't matter as much on a clustered index. Rule of thumb (IMO), 60-70 fillfactor for high writes, 70-90 for medium writes, and 90-100 for high reads/low writes. By dropping your fillfactor to 70, will mean that for every 100 records on a page, 70 records are written, which will leave free space of 30 records for new or reorganized records. Eats up more space, but it sure beats having to DEFRAG every night (see 4 below)
Make sure the statistics exist on the table. If you want to sweep the database to create statistics using the "sp_createstats 'indexonly'", then SQL Server will create all the statistics on all the indexes that the engine has accumulated as requiring statistics. Don't leave off the 'indexonly' attribute though or you'll add statistics for every field, that would then not be good.
Check the table/indexes using DBCC SHOWCONTIG to see which indexes are getting fragmented the most. I won't go into the details here, just know that you need to do it. Then based on that information, change the fillfactor up or down in relation to the changes the indexes are experiencing change and how fast (over time).
Setup a job schedule that will do online (DBCC INDEXDEFRAG) or offline (DBCC DBREINDEX) on individual indexes to defrag them. Warning: don't do DBCC DBREINDEX on this large of a table without it being during maintenance time cause it will bring the apps down ... especially on the CLUSTERED INDEX. You've been warned. Test and test this part.
Use the execution plans to see what SCANS, and FAT PIPES exist and adjust the indexes, then defrag and rewrite stored procs to get rid of those hot spots. If you see a RED object in your execution plan, it's because there are not statistics on that field. That's bad. This step is more of the "art than the science".
On off peak times, run the UPDATE STATISTICS WITH FULLSCAN to give the query engine as much information about the data distributions as you can. Otherwise do the standard UPDATE STATISTICS (with standard 10% scan) on tables during the weeknights or more often as you see fit with your observerations to make sure the engine has more information about the data distributions to retrieve the data for efficiently.
Sorry this is so long, but it's extremely important. I've only give you here minimal information but will help a ton. There's some gut feelings and observations that go in to strategies used by these points that will require your time and testing.
No need to go to Enterprise edition. I did though in order to get the features spoken of earlier with partitioning. But I did ESPECIALLY to have much better mult-threading capabilities with searching and online DEFRAGING and maintenance ... In Enterprise edition, it is much much better and more friendly with VLDBs. Standard edition doesn't handle doing DBCC INDEXDEFRAG with online databases as well.
The first thing I'd look at is indexing. If you use the execution plan generator in Management Studio, you want to see index seeks or clustered index seeks. If you see scans, particularly table scans, you should look at indexing the columns you generally search on to see if that improves your performance.
You should certainly not need to move to Enterprise edition for this.
[there is a clustered index with 6 fields, and two other indexes on single fields.]
Without knowing any details about the fields, I would try to find a way to make the clustered index smaller.
With SQL Server, all the clustered-key fields will also be included in all the non-clustered indices (as a way to do the final lookup from non-clustered index to actual data page).
If you have six fields at 8 bytes each = 48 bytes, multiply that by two more indices times 77 million rows - and you're looking at a lot of wasted space which translates into a lot
of I/O operations (and thus degrades performance).
For the clustered index, it's absolutely CRUCIAL for it to be unique, stable, and as small as possible (preferably a single INT or such).
Marc
Do you really need to have access to all 77 million records in a single table?
For example, if you only need access to the last X months worth of data, then you could consider creating an archiving strategy. This could be used to relocate data to an archive table in order to reduce the volume of data and subsequently, query time on your 'hot' table.
This approach could be implemented in the standard edition.
If you do upgrade to the Enterprise edition you can make use of table partitioning. Again depending on your data structure this can offer significant performance improvements. Partitioning can also be used to implement the strategy previously mentioned but with less administrative overhead.
Here is an excellent White paper on table partitioning in SQL Server 2005
http://msdn.microsoft.com/en-us/library/ms345146.aspx
I hope what I have detailed is clear and understandable. Please do feel to contact me directly if you require further assistance.
Cheers,
http://msdn.microsoft.com/en-us/library/ms143432.aspx
You've got some room to grow.
As far as performance issues, that's a whole other question. Caching, sharding, normalizing, indexing, query tuning, app code tuning, and so on.
Standard should be able to handle it. I would look at indexing and the queries you use with the table. You want to structure things in such a way that your inserts don't cause too many index recalcs, but your queries can still take advantage of the index to limit lookups to a small portion of the table.
Beyond that, you might consider partitioning the table. This will allow you to divide the table into several logical groups. You can do it "behind-the-scenes", so it still appears in sql server as one table even though it stored separately, or you can do it manually (create a new 'archive' or yearly table and manually move over rows). Either way, only do it after you looked at the other options first, because if you don't get that right you'll still end up having to check every partition. Also: partitioning does require Enterprise Edition, so that's another reason to save this for a last resort.
In and of itself, 77M records is not a lot for SQL Server. How are you loading the 100,000 records? is that a batch load each day? or thru some sort of OLTP application? and is that the performance issue you are having, i.e adding the data? or is it the querying that giving you the most problems?
If you are adding 100K records at a time, and the records being added are forcing the cluster-index to re-org your table, that will kill your performance quickly. More details on the table structure, indexes and type of data inserted will help.
Also, the amount of ram and the speed of your disks will make a big difference, what are you running on?
maybe these are minor nits, but....
(1) relational databases don't have FIELDS... they have COLUMNS.
(2) IDENTITY columns usually mean the data isn't normalized (or the designer was lazy). Some combination of columns MUST be unique (and those columns make up the primary key)
(3) indexing on datetime columns is usually a bad idea; CLUSTERING on datetime columns is also usually a bad idea, especially an ever-increasing datetime column, as all the inserts are contending for the same physical space on disk. Clustering on datetime columns in a read-only table where that column is part of range restrictions is often a good idea (see how the ideas conflict? who said db design wasn't an art?!)
What type of disks do you have?
You might monitor some disk counters to see if requests are queuing.
You might move this table to another drive by putting it in another filegroup. You can also to the same with the indexes.
Initially I wanted to agree with Marc. The width of your clustered index seems suspect, as it will essentially be used as the key to perform lookups on all your records. The wider the clustered index, the slower the access, generally. And a six field clustered index feels really, really suspect.
Uniqueness is not required for a clustered index. In fact, the best candidates for fields that should be in the clustered index are ones that are not unique and used in joins. For example, in a Persons table where each Person belongs to one Group and you frequently join Persons to Groups, while accessing batches of people by group, Person.group_id would be an ideal candidate, for this particular use case.

Resources