Strategies to preserve disk space in high input frequency application - database

I have a requirement to support shrinking of a database which has a data entry rate of 1 entry (approx. 300KB) per second. The database file can reach 3GB. The current database has no auto vacuum feature. Database file space going over a certain limit (say 3GB) is worst-case scenario.
My current strategy is to delete the oldest data (by clustered primary key), and then to CHECKPOINT DEFRAG. This does not seem to be reliable, and the VACUUM or DEFRAG can take a long time. I don't want to name database names but I am open to suggestions.
I was wondering what other strategies might be available to reliably (little to no downtime, O(1) operation speed) preserve disk space.
EDIT: Relational database required as reporting and SQL data extraction is necessary.
Fixed-size circular buffer - exactly - I need to replicate this in the relational world, using a low footprint DB, and fast "circular" behaviour (i.e fast inserts)

1) Determine how many rows of data you can store with your requirements.
2) Use a primary key that is an integer related to #1.
3) Create a way to track the primary key usage, example, store primary key and timestamp of usage in separate table.
4) When inserting always grab the oldest primary key that was used, and update that row with the new data.
If implemented properly, this will limit the # of rows to what you specify in #1.

Related

Database Software Design Review for a Key Value Single Server DB

Context on the problem statement.
Scroll to the bottom for questions.
Note: The tables are not relational, joins can be done at application level.
Classes
Record
Most atomic unit of the database (each record has key, value, id)
Page
Each file can store multiple records. Each page is a limited chunk (8 kb??), and it also stores an offset to retrieve each id at the top?
Index
A B-tree data structure, that stores ability to do log(n) lookups to find which id lives in which page.
We can also insert id's and page into this B-tree.
Table
Each Table is an abstraction over a directory that stores multiple pages.
Table also stores Index.
Database
Database is an abstraction over a directory which includes all tables that are a part of that database.
Database Manager
Gives ability to switch between different databases, create new databases, and drop existing databases.
Communication In Main Process
Initiates the Database Manager as it's own process.
When the process quits it saves Indexes back to disk.
The process also stores the indexes back to disk based on an interval.
To interact with this DB process we will use http to communicate with it.
Database Manager stores a reference to the current database being used.
The current database attribute stored in the Database Manager stores a reference to all Table's in a hashmap.
Each Table stores a reference to the index that is read from the index page from disk and kept in memory.
Each Table exposes public methods to set and get key value pair.
Get method navigates through b-tree to find the right page, on that page it finds the key val pair based on the offset stored on the first line, and returns a Record.
Each Set method adds a key val pair to the database and then updates the index for that table.
Outstanding Questions:
Am I making any logical errors in my design above?
How should I go about figuring what the data page size should be (Not sure why relation DB's do 8gb)?
How should I store the Index B-tree to disk?
Should the Database load all indexes for the table in memory at the very start ?
A couple of notes from the top of my head:
How many records do you anticipate storing? What are the maximum key and value sizes? I ask, because with a file per page scheme, you might find yourself exhausting available file handles.
Are the database/table distinctions necessary? What does this separation gain you? Truly asking the question, not being socratic.
I would define page size in terms of multiples of your maximum key and value sizes so that you can get good read/write alignment and not have too much fragmentation. It might be worth having a naive, but space inefficient, implementation that aligns all writes.
I would recommend starting with the simplest possible design (load all indices, align writes, flat schema) to launch your database, then layer on complexity and optimizations as they become needed, but not a moment before. Hope this helps!

Does DynamoDB scale when partition key bucket gets huge?

Does DynamoDB scale when partition key bucket gets huge? What are the possible solutions if this is really a problem?
TL;DR - Yes, DynamoDB scales very well as long as you use the service as intended!
As a user of Dynamo your responsibility is to choose a partition key (and range key combination if need be) that provides a relatively uniform distribution of the data and to allocate enough read and write capacity for your use case. If you do use a range key, this means you should aim to have approximately the same number of elements for each of the partition key values in your table.
As long as you follow this rule, Dynamo will scale very well. Even when you hit the size limit for a partition, Dynamo will automatically split the data in the original partition into two equal sized partitions which will each receive about half the data (again - as long as you did a good job of choosing the partition key and range key). This is very well explained in the Dynamo DB documentation
Of course, as your table grows and you get more and more partitions, you will have to allocate more and more read and write capacity to ensure enough is provisioned to sustain all partitions of your table. The capacity is equally distributed to all partitions in your table (except for splits - though, again, if the distribution is uniform even splits will receive capacity uniformly).
For reference, see also: how partitions work.

Reduce SQL Server table fragmentation without adding/dropping a clustered index?

I have a large database (90GB data, 70GB indexes) that's been slowly growing for the past year, and the growth/changes has caused a large amount of internal fragmentation not only of the indexes, but of the tables themselves.
It's easy to resolve the (large number of) very fragmented indexes - a REORGANIZE or REBUILD will take care of that, depending on how fragmented they are - but the only advice I can find on cleaning up actual table fragmentation is to add a clustered index to the table. I'd immediately drop it afterwards, as I don't want a clustered index on the table going forward, but is there another method of doing this without the clustered index? A "DBCC" command that will do this?
Thanks for your help.
Problem
Let's get some clarity, because this is a common problem, a serious issue for every company using SQL Server.
This problem, and the need for CREATE CLUSTERED INDEX, is misunderstood.
Agreed that having a permanent Clustered Index is better than not having one. But that is not the point, and it will lead into a long discussion anyway, so let's set that aside and focus on the posted question.
The point is, you have substantial fragmentation on the Heap. You keep calling it a "table", but there is no such thing at the physical data storage or DataStructure level. A table is a logical concept, not a physical one. It is a collection of physical DataStructures. The collection is one of two possibilities:
Heap
plus all Non-clustered Indices
plus Text/Image chains
or a Clustered Index
(eliminates the Heap and one Non-clustered Index)
plus all Non-clustered Indices
plus Text/Image chains.
Heaps get badly fragmented; the more interspersed (random)Insert/Deletes/Updates there are, the more fragmentation.
There is no way to clean up the Heap, as is. MS does not provide a facility (other vendors do).
Solution
However, we know that Create Clustered Index rewrites and re-orders the Heap, completely. The method (not a trick), therefore, is to Create Clustered Index only for the purpose of de-fragmenting the Heap, and drop it afterward. You need free space in the db of table_size x 1.25.
While you are at it, by all means, use FILLFACTOR, to reduce future fragmentation. The Heap will then take more allocated space, allowing for future Inserts, Deletes and row expansions due to Updates.
Note
Note that there are three Levels of Fragmentation; this deals with Level III only, fragmentation within the Heap, which is caused by Lack of a Clustered Index
As a separate task, at some other time, you may wish to contemplate the implementation of a permanent Clustered Index, which eliminates fragmentation altogether ... but that is separate to the posted problem.
Response to Comment
SqlRyan:
While this doesn't give me a magic solution to my problem, it makes pretty clear that my problem is a result of a SQL Server limitation and adding a clustered index is the only way to "defragment" the heap.
Not quite. I wouldn't call it a "limitation".
The method I have given to eliminate the Fragmentation in the Heap is to create a Clustered Index, and then drop it. Ie. temporarily, the only purpose of which is correct the Fragmentation.
Implementing a Clustered Index on the table (permanently) is a much better solution, because it reduces overall Fragmentation (the DataStructure can still get Fragmented, refer detailed info in links below), which is far less than the Fragmentation that occurs in a Heap.
Every table in a Relational database (except "pipe" or "queue" tables) should have a Clustered Index, in order to take advantage of its various benefits.
The Clustered Index should be on columns that distribute the data (avoiding INSERT conflicts), never be indexed on a monotonically increasing column, such as Record ID 1, which guarantees an INSERT Hot Spot in the last Page.
1. Record IDs on every File renders your "database" a non-relational Record Filing System, using SQL merely for convenience. Such Files have none of the Integrity, Power, or Speed of Relational databases.
Andrew Hill:
would you be able to comment further on "Note that there are three Levels of Fragmentation; this deals with Level III only" -- what are the other two levels of fragmentation?
In MS SQL and Sybase ASE, there are three Levels of Fragmentation, and within each Level, several different Types. Keep in mind that when dealing with Fragmentation, we must focus on DataStructures, not on tables (a table is a collection of DataStructures, as explained above). The Levels are:
Level I • Extra-DataStructure
Outside the DataStructure concerned, across or within the database.
Level II • DataStructure
Within the DataStructure concerned, above Pages (across all Pages)
This is the Level most frequently addressed by DBAs.
Level III • Page
Within the DataStructure concerned, within the Pages
These links provide full detail re Fragmentation. They are specific to Sybase ASE, however, at the structural level, the information applies to MS SQL.
Fragmentation Definition
Fragmentation Impact
Fragmentation Type
Note that the method I have given is Level II, it corrects the Level II and III Fragmentation.
You state that you add a clustered index to alleviate the table fragmentation, to then drop it immediately.
The clustered index removes fragmentation by sorting on the cluster key, but you say that this key would not be possible for future use. This begs the question: why defragment using this key at all?
It would make sense to create this clustered key and keep it, as you obviously want/need the data sorted that way. You say that data changes would incur data movement penalties that can't be borne; have you thought about creating the index with a lower FILLFACTOR than the default value? Depending upon data change patterns, you could benefit from something as low as 80%. You then have 20% 'unused' space per page, but the benefit of lower page splits when the clustered key values are changed.
Could that help you?
You can maybe compact the heap by running DBCC SHRINKFILE with NOTRUNCATE.
Based on comments, I see you haven't tested with a permenent clustered index.
To put this in perspective, we have database with 10 million new rows per day with clustered indexes on all tables. Deleted "gaps" will be removed via scheduled ALTER INDEX (and also forward pointers/page splits).
Your 12GB table may be 2GB after indexing: it merely has 12GB allocated but is massively fragmented too.
I understand your pain in being constrained by the design of a legacy design.
Have you the oppertunity to restore a backup of the table in question on another server and create a clustered index? It is very possible the clustered index if created on a set of narrow unique columns or an identity column will reduce the total table (data and index) size.
In one of my legacy apps all the data was accessed via views. I was able to modify the schema of the underlying table adding an identity column and a clustered index without effecting the application.
Another drawback of having the heap is the extra IO associated with any fowarded rows.
I found the article below effective when I was asked if there was any PROOF that we needed a clusted index permanently on the table
This article is by Microsoft
The problem that no one is talking about is FRAGMENTATION OF THE DATA OR LOG DEVICE FILES ON THE HARD DRIVE(s) ITSELF!! Everyone talks about fragmentation of the indexes and how to avoid/limit that fragmentation.
FYI: When you create a database, you specify the INITIAL size of the .MDF along with how much it will grow by when it needs to grow. You do the same with the .LDF file. THERE IS NO GUARANTEE THAT WHEN THESE TWO FILES GROW THAT THE DISK SPACE ALLOCATED FOR THE EXTRA DISK SPACE NEEDED WILL BE PHYSICALLY CONTIGUOUS WITH THE EXISTING DISK SPACE ALLOCATED!!
Every time one of these two device files needs to expand, there is the possibility of fragmentation of the hard drive disk space. That means the heads on the hard drive need to work harder (and take more time) to move from one section of the hard drive to another section to access the necessary data in the database. It is analogous to buying a small plot of land and building a house that just fits on that land. When you need to expand the house, you have no more land available unless you buy the empty lot next door - except - what if someone else, in the meantime, has already bought that land and built a house on it? Then you CANNOT expand your house. The only possibility is to buy another plot of land in the "neighborhood" and build another house on it. The problem becomes - you and two of your children would live in House A and your wife and third child would live in House B. That would be a pain (as long as you were still married).
The solution to remedy this situation is to "buy a much larger plot of land, pick up the existing house (i.e. database), move it to the larger plot of land and then expand the house there". Well - how do you do that with a database? Do a full backup, drop the database (unless you have plenty of free disk space to keep both the old fragmented database - just in case - as well as the new database), create a brand new database with plenty of initial disk space allocated (no guarantee that the operating system will insure that the space that you request will be contiguous) and then restore the database into the new database space just created. Yes - it is a pain to do but I do not know of any "automatic disk defragmenter" software that will work on SQL database files.

How to optimize a table for fast inserts only?

I have a log table that will receive inserts from several web apps. I wont be doing any searching/sorting/querying of this data. I will be pulling the data out to another database to run reports. The initial table is strictly for RECEIVING the log messages.
Is there a way to ensure that the web applications don't have to wait on these inserts? For example I know that adding a lot of indexes would slow inserts, so I won't. What else is there? Should I not add a primary key? (Each night the table will be pumped to a reports DB which will have a lot of keys/indexes)
If performance is key, you may not want to write this data to a database. I think most everything will process a database write as a round-trip, but it sounds like you don't want to wait for the returned confirmation message. Check if, as S. Lott suggests, it might not be faster to just append a row to a simple text file somewhere.
If the database write is faster (or necessary, for security or other business/operational reasons), I would put no indexes on the table--and that includes a primary key. If it won't be used for reads or updates, and if you don't need relational integrity, then you just don't need a PK on this table.
To recommend the obvious: as part of the nightly reports run, clear out the contents of the table. Also, never reset the database file sizes (ye olde shrink database command); after a week or so of regular use, the database files should be as big as they'll ever need to be and you won't have to worry about the file growth performance hit.
Here are a few ideas, note for the last ones to be important you would have extremly high volumns:
do not have a primary key, it is enforced via an index
do not have any other index
Create the database large enough that you do not have any database growth
Place the database on it's own disk to avoid contention
Avoid software RAID
place the database on a mirrored disk, saves the calculating done on RAID 5
No keys,
no constraints,
no validation,
no triggers,
No calculated columns
If you can, have the services insert async, so as to not wait for the results (if that is acceptable).
You can even try to insert into a "daily" table, which should then be less records,
and then move this across before the batch runs at night.
But mostly on the table NO KEYS/Validation (PK and Unique indexes will kill you)

Very large tables in SQL Server

We have a very large table (> 77M records and growing) runing on SQL Server 2005 64bit Standard edition and we are seeing some performance issues. There are up to a hundred thousand records added daily.
Does anyone know if there is a limit to the number of records SQL server Standard edition can handle? Should be be considering moving to Enterprise edition or are there some tricks we can use?
Additional info:
The table in question is pretty flat (14 columns), there is a clustered index with 6 fields, and two other indexes on single fields.
We added a fourth index using 3 fields that were in a select in one problem query and did not see any difference in the estimated performance (the query is part of a process that has to run in the off hours so we don't have metrics yet). These fields are part of the clustered index.
Agreeing with Marc and Unkown above ... 6 indexes in the clustered index is way too many, especially on a table that has only 14 columns. You shouldn't have more than 3 or 4, if that, I would say 1 or maybe 2. You may know that the clustered index is the actual table on the disk so when a record is inserted, the database engine must sort it and place it in it's sorted organized place on the disk. Non clustered indexes are not, they are supporting lookup 'tables'. My VLDBs are laid out on the disk (CLUSTERED INDEX) according to the 1st point below.
Reduce your clustered index to 1 or 2. The best field choices are the IDENTITY (INT), if you have one, or a date field in which the fields are being added to the database, or some other field that is a natural sort of how your data is being added to the database. The point is you are trying to keep that data at the bottom of the table ... or have it laid out on the disk in the best (90%+) way that you'll read the records out. This makes it so that there is no reorganzing going on or that it's taking one and only one hit to get the data in the right place for the best read. Be sure to put the removed fields into non-clustered indexes so you don't lose the lookup efficacy. I have NEVER put more than 4 fields on my VLDBs. If you have fields that are being update frequently and they are included in your clustered index, OUCH, that's going to reorganize the record on the disk and cause COSTLY fragmentation.
Check the fillfactor on your indexes. The larger the fill factor number (100) the more full the data pages and index pages will be. In relation to how many records you have and how many records your are inserting you will change the fillfactor # (+ or -) of your non-clustered indexes to allow for the fill space when a record is inserted. If you change your clustered index to a sequential data field, then this won't matter as much on a clustered index. Rule of thumb (IMO), 60-70 fillfactor for high writes, 70-90 for medium writes, and 90-100 for high reads/low writes. By dropping your fillfactor to 70, will mean that for every 100 records on a page, 70 records are written, which will leave free space of 30 records for new or reorganized records. Eats up more space, but it sure beats having to DEFRAG every night (see 4 below)
Make sure the statistics exist on the table. If you want to sweep the database to create statistics using the "sp_createstats 'indexonly'", then SQL Server will create all the statistics on all the indexes that the engine has accumulated as requiring statistics. Don't leave off the 'indexonly' attribute though or you'll add statistics for every field, that would then not be good.
Check the table/indexes using DBCC SHOWCONTIG to see which indexes are getting fragmented the most. I won't go into the details here, just know that you need to do it. Then based on that information, change the fillfactor up or down in relation to the changes the indexes are experiencing change and how fast (over time).
Setup a job schedule that will do online (DBCC INDEXDEFRAG) or offline (DBCC DBREINDEX) on individual indexes to defrag them. Warning: don't do DBCC DBREINDEX on this large of a table without it being during maintenance time cause it will bring the apps down ... especially on the CLUSTERED INDEX. You've been warned. Test and test this part.
Use the execution plans to see what SCANS, and FAT PIPES exist and adjust the indexes, then defrag and rewrite stored procs to get rid of those hot spots. If you see a RED object in your execution plan, it's because there are not statistics on that field. That's bad. This step is more of the "art than the science".
On off peak times, run the UPDATE STATISTICS WITH FULLSCAN to give the query engine as much information about the data distributions as you can. Otherwise do the standard UPDATE STATISTICS (with standard 10% scan) on tables during the weeknights or more often as you see fit with your observerations to make sure the engine has more information about the data distributions to retrieve the data for efficiently.
Sorry this is so long, but it's extremely important. I've only give you here minimal information but will help a ton. There's some gut feelings and observations that go in to strategies used by these points that will require your time and testing.
No need to go to Enterprise edition. I did though in order to get the features spoken of earlier with partitioning. But I did ESPECIALLY to have much better mult-threading capabilities with searching and online DEFRAGING and maintenance ... In Enterprise edition, it is much much better and more friendly with VLDBs. Standard edition doesn't handle doing DBCC INDEXDEFRAG with online databases as well.
The first thing I'd look at is indexing. If you use the execution plan generator in Management Studio, you want to see index seeks or clustered index seeks. If you see scans, particularly table scans, you should look at indexing the columns you generally search on to see if that improves your performance.
You should certainly not need to move to Enterprise edition for this.
[there is a clustered index with 6 fields, and two other indexes on single fields.]
Without knowing any details about the fields, I would try to find a way to make the clustered index smaller.
With SQL Server, all the clustered-key fields will also be included in all the non-clustered indices (as a way to do the final lookup from non-clustered index to actual data page).
If you have six fields at 8 bytes each = 48 bytes, multiply that by two more indices times 77 million rows - and you're looking at a lot of wasted space which translates into a lot
of I/O operations (and thus degrades performance).
For the clustered index, it's absolutely CRUCIAL for it to be unique, stable, and as small as possible (preferably a single INT or such).
Marc
Do you really need to have access to all 77 million records in a single table?
For example, if you only need access to the last X months worth of data, then you could consider creating an archiving strategy. This could be used to relocate data to an archive table in order to reduce the volume of data and subsequently, query time on your 'hot' table.
This approach could be implemented in the standard edition.
If you do upgrade to the Enterprise edition you can make use of table partitioning. Again depending on your data structure this can offer significant performance improvements. Partitioning can also be used to implement the strategy previously mentioned but with less administrative overhead.
Here is an excellent White paper on table partitioning in SQL Server 2005
http://msdn.microsoft.com/en-us/library/ms345146.aspx
I hope what I have detailed is clear and understandable. Please do feel to contact me directly if you require further assistance.
Cheers,
http://msdn.microsoft.com/en-us/library/ms143432.aspx
You've got some room to grow.
As far as performance issues, that's a whole other question. Caching, sharding, normalizing, indexing, query tuning, app code tuning, and so on.
Standard should be able to handle it. I would look at indexing and the queries you use with the table. You want to structure things in such a way that your inserts don't cause too many index recalcs, but your queries can still take advantage of the index to limit lookups to a small portion of the table.
Beyond that, you might consider partitioning the table. This will allow you to divide the table into several logical groups. You can do it "behind-the-scenes", so it still appears in sql server as one table even though it stored separately, or you can do it manually (create a new 'archive' or yearly table and manually move over rows). Either way, only do it after you looked at the other options first, because if you don't get that right you'll still end up having to check every partition. Also: partitioning does require Enterprise Edition, so that's another reason to save this for a last resort.
In and of itself, 77M records is not a lot for SQL Server. How are you loading the 100,000 records? is that a batch load each day? or thru some sort of OLTP application? and is that the performance issue you are having, i.e adding the data? or is it the querying that giving you the most problems?
If you are adding 100K records at a time, and the records being added are forcing the cluster-index to re-org your table, that will kill your performance quickly. More details on the table structure, indexes and type of data inserted will help.
Also, the amount of ram and the speed of your disks will make a big difference, what are you running on?
maybe these are minor nits, but....
(1) relational databases don't have FIELDS... they have COLUMNS.
(2) IDENTITY columns usually mean the data isn't normalized (or the designer was lazy). Some combination of columns MUST be unique (and those columns make up the primary key)
(3) indexing on datetime columns is usually a bad idea; CLUSTERING on datetime columns is also usually a bad idea, especially an ever-increasing datetime column, as all the inserts are contending for the same physical space on disk. Clustering on datetime columns in a read-only table where that column is part of range restrictions is often a good idea (see how the ideas conflict? who said db design wasn't an art?!)
What type of disks do you have?
You might monitor some disk counters to see if requests are queuing.
You might move this table to another drive by putting it in another filegroup. You can also to the same with the indexes.
Initially I wanted to agree with Marc. The width of your clustered index seems suspect, as it will essentially be used as the key to perform lookups on all your records. The wider the clustered index, the slower the access, generally. And a six field clustered index feels really, really suspect.
Uniqueness is not required for a clustered index. In fact, the best candidates for fields that should be in the clustered index are ones that are not unique and used in joins. For example, in a Persons table where each Person belongs to one Group and you frequently join Persons to Groups, while accessing batches of people by group, Person.group_id would be an ideal candidate, for this particular use case.

Resources