I have a user_info table in five different locations. Now I need to integrate all the rows into a central user_info table. To do this I need a unique id for each row in source tables. Because when all the rows come into the central table, then each user must have a unique ID.
Now my questions are:
if I use uniqueidentifier NEWID() for each source table then is it will be unique for globally & life time or have any chance to be duplicate?
How does SQL Server create the NEWID()? I need to know the key generation structure.
Yes, there is no chance of a duplicate between machines.
NEWID() is based on a combination of a pseudo random number (from the clock) and the MAC address of the primary NIC.
However, inserting random numbers like this as the clustered key on a table is terrible for performance. You should consider either NEWSEQUENTIALID() or a COMB-type function for generating GUIDs that still offer the collision-avoidance benefits of NEWID() while still maintaining acceptable INSERT performance.
Related
In my csv files that I want to bulk insert into SQL Server, there's text (serial number) that's not in the .csv column format that I would LOVE to use as a primary key.
EX.
Data from Engine SQL03423,
version 21.04,
time, speed, temp
june 3 1:00, 90, 200
june 3 1:01, 69, 392
The SQL03423 I want to use as a primary key in my database.
However I get reports from this particular engine daily and if I get to use it as a primary key I'm sure I'll run into the issue of using the same primary key the next time I insert new data which will give me an error.
How do I get around this?
I need the serial number regardless even if it's not the primary key.
Also if I can't use it as a primary key, how do I create a "dummy" primary key into the target table that will autoincrement even if that particular column is obviously not in the csv files I'm importing? Is this even possible?
I am aware of stored procedures, views, etc in SQL. I have basic knowledge if that helps.
I'd suggest using an "artifical" primary key most of the time, which for your question means that you should go for an additional column instead of using the machine's serial number. Preferably, this should be an UUID (GUID) value instead of a simple INT. Why? Please have a look at this great post from Jeff Atwood explaining it: https://blog.codinghorror.com/primary-keys-ids-versus-guids/
Just add a line like this to your table declaration:
THE_IDCOL_NAME UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY
Find more about NewID() here: https://learn.microsoft.com/en-us/sql/t-sql/functions/newid-transact-sql
One of the great advantages of using UUIDs is that you do not have to worry about number ranges overlapping for ID generation. Example scenario from my experience: customer rents machines worldwide and the software logs usage information. Each site uses an idependent database that was being consolidated to the HQ database. If any of the sites would use overlapping number ranges as primary keys, they would have been into trouble. Using GUIDs solved this.
I'm developing a database-intensive application which maintains about 5 tables. These tables contain many thousands of records each. All the tables use GUID clustered primary keys. To make it efficient, I've dropped foreign-keys between the tables.
I am running a script 65000 lines long which creates a whole bunch of tables (including my tables) and stored procedures (about half the time spent there) then proceeds to insert into my tables about 40000 records and then updates about 20000 of those records.
It takes 1:15 on my AMD 3.5 Ghz 8-core machine.
Amazingly, if I change those 5 tables such that
- Add a BIGINT identity surrogate primary key (the queries still join using GUID)
- Demote the prior clustered GUID primary key to a unique column
then it runs in 3:00 minutes!
Changing it from BIGINT to INT gets to about 1:30!
How is it possible that a clustered GUID PK runs significantly faster than an autoincremented INT and much faster than an autoincremented BIGINT clustered PK?
NOTE: the GUID values themselves are generated in code, not by DB.
Check out this simplified benchmark script demonstrating what i mean.
http://pastebin.com/ux5wUJgC
Using your test cases, this is expected. The first test only grows a table with one field. The other two build two columns and two indexes.
Here is a more appropriate test. All three tests have a GUID field and an INT (or BIGINT) field. All fields are indexed. The test table with a PK on an INT with a nonclustered index on the UID is faster by 2 seconds on my server.
Here is my test code: http://pastebin.com/MFTA3Da1
After much testing, it turns out that using guid pk is faster than int surrogate key and a guid natural key.
The talk about avoiding GUID primary keys due to clustering and fragmentation is of little utility since if you're talking about GUID identifiers in the first place, then it's likely that the GUID is intrinsic to the data model and must be stored in the data model anyway, so clearly a single GUID primary key is the simplest and fastest option (by far).
In a nutshell - if you need to identify records with guids then their key should be the guid!
Using SQL Server 2005 and 2008.
I've got a potentially very large table (potentially hundreds of millions of rows) consisting of the following columns:
CREATE TABLE (
date SMALLDATETIME,
id BIGINT,
value FLOAT
)
which is being partitioned on column date in daily partitions. The question then is should the primary key be on date, id or value, id?
I can imagine that SQL Server is smart enough to know that it's already partitioning on date and therefore, if I'm always querying for whole chunks of days, then I can have it second in the primary key. Or I can imagine that SQL Server will need that column to be first in the primary key to get the benefit of partitioning.
Can anyone lend some insight into which way the table should be keyed?
As is the standard practice, the Primary Key should be the candidate key that uniquely identifies a given row.
What you wish to do, is known as Aligned Partitioning, which will ensure that the primary key is also split by your partitioning key and stored with the appropriate table data. This is the default behaviour in SQL Server.
For full details, consult the reference Partitioned Tables and Indexes in SQL Server 2005
There is no specific need for the partition key to be the first field of any index on the partitioned table, as long as it appears within the index it can then be aligned to the partition scheme.
With that in mind, you should apply the normal rules for index field order supporting the most queries / selectivity of the values.
I have a brown-field SQL Server 2005 database that uses standard, unsorted GUIDs as the majority of the primary keys values and also in clustered indexes (which is bad for performance).
How should I go about changing these to sequential GUIDs? One of the challenges would be to replace all of the foreign key values as I change each the primary key.
Do you know of any tools or scripts to perform this type of conversion?
remember that you can only use the newsequentialid() function as a default
so create a new table with 2 columns. Insert the key from you original table into this one (leave the other column out it will fill itself)
join back to the original table and update the PK with the newsequantialid, if you have cascade update the FKs should update by themselves
How do you know it's bad for performance?
The advantage of GUIDs is that you don't have to worry about multiple processes creating records at the same time. It can simplify your code considerably, depending on the program.
SequentialGuids are best for performance when the Guid is the PK. This is the reason behind their existence.
In tables where you need only 1 column as the key, and values in that column can be integers, when you shouldn't use an identity field?
To the contrary, in the same table and column, when would you generate manually its values and you wouldn't use an autogenerated value for each record?
I guess that it would be the case when there are lots of inserts and deletes to the table. Am I right? What other situations could be?
If you already settled on the surrogate side of the Great Primary Key Debacle then I can't find a single reason not use use identity keys. The usual alternatives are guids (they have many disadvatages, primarily from size and randomness) and application layer generated keys. But creating a surrogate key in the application layer is a little bit harder than it seems and also does not cover non-application related data access (ie. batch loads, imports, other apps etc). The one special case is distributed applications when guids and even sequential guids may offer a better alternative to site id + identity keys..
I suppose if you are creating a many-to-many linking table, where both fields are foreign keys, you don't need an identity field.
Nowadays I imagine that most ORMs expect there to be an identity field in every table. In general, it is a good practice to provide one.
I'm not sure I understand enough about your context, but I interpret your question to be:
"If I need the database to create a unique column (for whatever reason), when shouldn't it be a monotonically increasing integer (identity) column?"
In those cases, there's no reason to use anything other than the facility provided by the DBMS for the purpose; in your case (SQL Server?) that's an identity.
Except:
If you'll ever need to merge the table with data from another source, use a GUID, which will prevent duplicate keys from colliding.
If you need to merge databases it's a lot easier if you don't have to regenerate keys.
One case of not wanting an identity field would be in a one to one relationship. The secondary table would have as its primary key the same value as the primary table. The only reason to have an identity field in that situation would seem to be to satisfy an ORM.
You cannot (normally) specify values when inserting into identity columns, so for example if the column "id" was specified as an identify the following SQL would fail:
INSERT INTO MyTable (id, name) VALUES (1, 'Smith')
In order to perform this sort of insert you need to have IDENTITY_INSERT on for that table - this is not intended to be on normally and can only be on for a maximum of 1 tables in the database at any point in time.
If I need a surrogate, I would either use an IDENTITY column or a GUID column depending on the need for global uniqueness.
If there is a natural primary key, or the primary key is defined as a unique combination of other foreign keys, then I typically do not have an IDENTITY, nor do I use it as the primary key.
There is an exception, which is snapshot configuration tables which I am tracking with an audit trigger. In this case, there is usually a logical "primary key" (usually date of the snapshot and natural key of the row - like a cost center or gl account number for which the row is a configuration record), but instead of using the natural "primary key" as the primary key, I add an IDENTITY and make that the primary key and make a unique index or constraint on the date and natural key. Although theoretically the date and natural key shouldn't change, in these tables, if a user does that instead of adding a new row and deleting the old row, I want the audit (which reflects a change to a row identified by its primary key) to really reflect a change in the row - not the disappearance of a key and the appearance of a new one.
I recently implemented a Suffix Trie in C# that could index novels, and then allow searches to be done extremely fast, linear to the size of the search string. Part of the requirements (this was a homework assignment) was to use offline storage, so I used MS SQL, and needed a structure to represent a Node in a table.
I ended up with the following structure : NodeID Character ParentID, etc, where the NodeID was a primary key.
I didn't want this to be done as an autoincrementing identity for two main reasons.
How do I get the value of a NodeID after I add it to the database/data table?
I wanted more control when it came to generating my own IDs.