I'm designing a database (SQLite, SQL Server and DB2), where a table holds a 32kB blob that must be unique. The table will typically hold about 20,000 rows.
I can think of two solutions,
1 - Make the blob a unique index.
2 - Calculate a hash index of the blob, use that as a non unique index, and write code that enforces the blob's uniqueness.
Solution 1 is safer, but is the storage space overhead and performance penalty bad enough to make solution 2 a better choice?
I would go with #2, partly as a space-saving measure, but more because some DBMS's don't allow indexes on LOBs (Oracle comes to mind, but that may be an old restriction).
I would probably create two columns to for hash values, MD5 and SHA1 (both commonly supported in client languages). Then add a unique composite index that covers those two columns. The likelihood of a collision on both hashes is infinitesimally small, particularly given your anticipated table sizes. However, you should still have a recovery strategy (which could be as simple as setting one of the values to 0).
Related
I've a table which I need to give unique constraint to multiple columns. But instead of creating multi column unique index, I can also introduce an extra column based on hashing of all the required fields. So which one will be more effective in terms of database performance?
MySQL suggests the hashed column method but I couldn't find any information regarding SqlServer.
The link you give states:
If this column is short, reasonably unique, and indexed, it might be faster than a “wide” index on many columns.
So the performance improvement really relies on the indexed hash being quite a bit smaller than the combined multiple columns. This could easily not be the case, given that an MD5 is 16 bytes. I'd consider how much wider the average index key would be for the multi-columnindex, and to be honest I'd probably not bother with the hash anyway.
You could, if you feel inclined, benchmark your system with both approaches. And if the potential benefits don't tempt you into trying that, again I'd not bother.
I've used the technique more often for change detection, where checking for a change in 100 separate columns of a table row is much more compute intensive than comparing two hashes.
My database has one very large table with over 2 billion rows with 3 columns.
Id(uniqueidentity), Type(int, between 0-10. 0 = most used. 10 = least used), Data(Binary data between 1-10MB)
What are some ways I can optimize this database? (primarily select queries)
*Note: I might add a few more columns to this table later (eg: location, date...)
Assuming that the id column is the clustered index key, and assuming that by uniqueidentity you mean uniqueidentifier:
do you need the uniqueidentifier type? Why?
What other alternatives have you considered?
Do you populate the data using sequential GUIDs or not?
GUIDs are a notoriously poor choise for clustered keys. See GUIDs as PRIMARY KEYs and/or the clustering key for a more detailed discussion:
But, a GUID that is not sequential -
like one that has it's values
generated in the client (using .NET)
OR generated by the newid() function
(in SQL Server) can be a horribly bad
choice - primarily because of the
fragmentation that it creates in the
base table but also because of its
size. It's unnecessarily wide (it's 4
times wider than an int-based identity
- which can give you 2 billion (really, 4 billion) unique rows). And,
if you need more than 2 billion you
can always go with a bigint (8-byte
int) and get 2^63-1 rows
Also read Disk space is cheap...That's not the point! as a follow up.
Other than this, you need to do your homework and post the required details for such a question: exact table and index definition, prevalent data access pattern (by key, by range, filters sort order, joins etc etc).
Have you done any work to identify problems so far? If not, start with Waits and Queues, a proven methodology to identify performance bottlenecks. Once you measure and find places that need improvement, we can advise how to improve.
Add an Index(es). Decide which column(s) are the most appropriate clustered index.
Decide if storing 10MB of binary data in each (otherwise small) row is a good use of a database
[Updated in response to Remus's comment]
Is there any difference in performance (in terms of inserting/updating & querying) a table if the primary key is a single column (e.g., a GUID generated for every row) or multiple columns (e.g., a foreign key GUID + an offset number)?
I would assume querying speeds should be quicker if anything with multi-column primary keys, however I would imagine inserting would be slower due to a slightly more complicated unique check? I also imagine the data types of a multi-column primary key could also matter (e.g., if one of the columns was a DateTime type it would add complexity). These are just my thoughts to invoke answers & discussion (hopefully!) and are not fact based.
I realise there are some other questions covering this topic, but I'm wondering about performance impacts rather than management/business concerns.
You will be affected more by (each) component of the key being (a) variable length and (b) the width [wide instead of narrow columns], than the number of components in the key. Unless MS have broken it again in the latest release (they broke Heaps in 2005). Datatype does not slow it down; the width, and particularly variable length (any datatype) does. Note that a fixed len column is made variable if it is set to Nullable. Variable len columns in indices is bad news, because a bit of "unpacking" has to be performed on every access, to get at the data.
Obviously, keep indexed columns as narrow as possible, using fixed, and not Nullable columns only.
In terms of number of columns in a compound key, sure one column is faster than seven, but not that much: three fat wide variable columns are much slower than seven thin fixed columns.
GUID is of course a very fat key; GUID plus anything else is very very fat; GUID Nullable is Guiness material. Unfortunately it is the knee-jerk reaction to solving the IDENTITY problem, which in turn is a consequence of not having chosen good natural relational keys. So you are best advised to fix the real problem at the source, and choose good natural keys; avoid IDENTITY; avoid GUID.
Experience and performance tuning, not conjecture.
It depends on your access patterns, read/write ratio and whether (possibly most importantly) the clustered index is defined on the Primary Key.
Rule of thumb is make your primary key as small as possible (32 bit int) and define the clustered index on a monotonically increasing key (think IDENTITY) where possible, unless you have range searches that form a large proportion of the queries against that table.
If your application is write intensive, and you define the clustered index on the GUID column you should note:
All non-clustered indexes will
contain the clustered index key and will therefore be larger. This may have a negative effect of performance if there are many NC indexes.
Unless you are using an 'ordered'
GUID (such as a COMB or using
NEWSEQUENTIALID()), your inserts
will fragment the index over time. This means
you need a regular index rebuild and
possibly increasing the amount of
free space left in pages (fill
factor)
Because there are many factors at work (hardware, access patterns, data size), I suggest you run some tests and benchmark your particular circumstances..
It depends on the indexing and storage in each case. All other things being equal, the choice of primary key is irrelevant as far as performance is concerned. The choice of indexes and other storage options would be the deciding factor.
If your situation is going to be geared towards a higher number of inserts, then the smaller footprint possible, the better.
There are two things you need to separate, the concept of the primary key at the database level, and the concept of the key your application uses.
Why do you need a GUID? Are you going to be inserting into multiple database server, and then combining the information into one centralized database?
If that is the case then my recommendation is an identity followed by a guid. Clustered index on the identity, and Unique Non clustered on the GUID. If you use the GUID as a Clustered index, then your data inserts will be all over the place. Meaning your data will not be inserted sequentially, and this causes performance problems as your system will be inserting and moving pages around randomly.
Having your data inserted nice in an ordered faction, thanks to the identity, is the way to go. You can leave the sorting to the index structure( the nonclusered unique containing the GUID), which is a much more efficient structure to sort than using the table data.
I'm fairly well versed in SQL server performace but I constanly have to argue down the idea that GUIDs should be used as the default type for Clusterd Primary Keys.
Assuming that the table has a fairly low amount of inserts per day (5000 +/- rows / day), what kind of performace issues could we run into? How will page splits affect our seek performance? How often should I reindex (or should I defrag)? What should I set the fill factors to (100, 90, 80, ect)?
What if I were inserting 1,000,000 rows per day?
I apologize beforhand for all of the questions, but i'm looking to get some backup for not using GUIDs as our default for PKs. I am however completely open to having my mind changed by the overwehlming knowledge from the StackOverflow user base.
If you are doing any kind of volume, GUIDs are extremely bad as a PK bad unless you use sequential GUIDs, for the exact reasons you describe. Page fragmentation is severe:
Average Average
Fragmentation Fragment Fragment Page Average
Type in Percent Count Size Count Space Used
id 4.35 7 16.43 115 99.89
newidguid 98.77 162 1 162 70.90
newsequentualid 4.35 7 16.43 115 99.89
And as this comparison between GUIDs and integers shows:
Test1 caused a tremendous amount of page splits, and had a scan density around 12% when I ran a DBCC SHOWCONTIG after the inserts had completed. The Test2 table had a scan density around 98%
If your volume is very low, however, it just doesn't matter that much.
If you do really need a globally unique ID but have high volume (and can't use sequential IDs), just put the GUIDs in an indexed column.
Drawbacks of using GUID as primary key:
No meaningful ordering, means indexing doesn't give performance boost as it does with an integer.
Size of a GUID 16 bytes, versus 2, 4 or 8 bytes for an integer.
Very difficult for humans to remember, so no good as a reference id.
Advantages:
Allow non-guessable primary keys that can therefore be less dangerous when displayed in a web page query string or in the application.
Useful in Databases that don't provide an auto increment or identity data type.
Useful when you need to join data between two disparate data sources across platforms or environments.
I thought the decision as to whether to use GUIDs was pretty simple, but maybe I'm unaware of other issues.
With such a low inserts per day, I doubt that page splitting should be a significant factor. The real question is how does 5,000 compares with the existing row count, as this would be the main information needed to decide on an appropriate initial fill factor to deffer splits.
This said, I'm personally not a big fan of GUIDs. I understand that they can serve well in some contexts but in many cases they are just "in the way" [of efficiency, of ease of use, of ...]
I find the following questions useful to narrow down on deciding whether GUID should be used or not.
Will the PK be shared/published ? (i.e. will it be used beyond its internal use within SQL, will applications need these keys in a somewhat persistent fashion? Will
users somehow see these keys?
Could the PK be used to help merge disparate data sources ?
Does the table have a primary -possibly composite- made from column(s) in the data ? What is the size of this possible this key
How do the primary keys sort? If composite, are the first few columns selective ?
Using a guid (unless it is a sequential GUID) as a clustered index is going to kill insert performance. Since the physical table layout is aligned according to the clustered index, using a guid which has a random sequencing order will cause serious table fragmentation. If you want to use a guid as a PK/Clustered index it must be a sequential guid using the newsequentialid() function in sql server. This will guarantee that the generated guids are ordered sequentially and prevent fragmentation.
Suppose you have a very large database, and to simplify lets say it consists of one major table you will be doing your lookups on with one (and only one) primary key field - pk.
Given the fact that all lookups are going to be basically SELECT * FROM table_name WHERE pk=someKeyValue, what is the best way to optimize this database for the fastest lookups?
Edit: just a few more details - INSERTs and UPDATEs are going to be very non-frequent so I don't mind sacrificing performance there to achieve better lookup performance.
Also, seems like clustering is the way to go. Do you have any examples of the kind of increase in performance I can achieve with this method? And how exactly is this done (on any kind of DB)?
If the primary key is clustered, then you won't get any quicker.
If it isn't clustered, and the number of columns in your table is relatively small, then you could in theory create a covering index to speed up the query. But then this negates any insert/update performance enhancements that having the non-clustered primary key would have given you.
If your primary key is an always-increasing field (e.g. a SQL Server identity, or generated from a sequence in Oracle) then the clustered primary key has no drawbacks anyway.
One thing you could do is make the primary key clustered, this results in the actual data being physically ordered on the disk, resulting in faster queries.
It will also mean slower inserts, but if you select much more frequently than you insert, this should not be a problem.
If you're using MySQL, you can do some additional things (beyond tuning your cache values). The table engine can be a factor; for instance, MyISAM is widely held to be faster at SELECTs than InnoDB. If this table is primarily a lookup table, and you were using MySQL, that might be a good thing to do. (InnoDB is pretty good on average; it's better on writes than MyISAM, and also, InnoDB never needs to be repaired.)
I have to add two more options to all that was proposed above (I like dwc’s answer). You should consider partitioning if your table is really big.
First, horizontal partitioning (especially if I/O is bottleneck in your DB). You create several filegroups and locate them on different hard drives. Then, create Partition Function, Partition Scheme to divide your table and put parts of your table on separate HDs (like rows 1-499999 to the F: drive, 500000-999999 to the G: drive, and so on) .
Second, vertical partitioning. This would work if you select column sets (not *) in most of your queries. In this case, divide columns in the table in two groups: first, fields that you need in all queries; second, fields that you rarely need. Create two tables with the same primary key. Use JOINs on the primary key when you need columns from both tables.
(This answer pertains to SQL Server 2005/2008.)
If all your queries are going to be based off the PK, you wouldn't get any added benefit by setting an index on the PK since it should already be indexing by that.
Edit: The only other possible things I would suggest is looking at normalizing your table (if that is even an option or necessity). By splitting off items into other tables, you can refine what is being pulled back in each query and only pull the less-used items when needed using joins.
Based off the limited description of "a very large database with a single table" it is hard to locate any easy and obvious ways to optimize without looking at what kind of data you are actually storing in your fields.
If your PK order matches insertion order, i.e. time or id/autoincrement, then make it clustered. This will reduce disk and cache thrashing on inserts, leaving more resources to devote to lookups.
Consider tweaking page sizes on the table to be an exact multiple of your record size. This requires intimate knowledge of the particular database software for details of how, and record/index overhead, etc.
If practical, use fixed-size for all columns rather than variable size.
Consider putting the index and/or transaction log files on a separate volume.
Install as much RAM as the software and hardware can use.
If you were using Oracle then I'd advise benchmarking three approaches:
Heap table with primary key index
Index-organised table
Single table hash cluster
1 represents a very vanilla approach -- really it's the lowest common denominator, but could mean 5+ logical reads to get each row, with one of those being a probable physical read of the table if it is not completely cached.
2 will save you one of those logical read by avoiding the probe to a separate table segment, but might not save you the physical read because the IOT segment will be larger and harder to cache than the index alone.
3 will potentially get you the row with a single logical read, but unless you have the entire table cached that's probably going to translate into a physical read.
Benchmarking is highly recommended.