In our application, we create a lot of rows in a single table, based on some calculations. Due to the volume of rows, we use the BULK INSERT from within our .Net application, to write the rows quickly.
But, we need to know which IDs were written in that BULK INSERT call. So, the idea is to generate a GUID, and add it to each row being written in the Bulk Update. So the GUID gets persisted in the table.
If we need to see what rows were written, we can SELECT .. FROM TABLE ... WHERE SessionID = the guid we generated.
I'd have a column on the table called SessionID (for example), VARCHAR(50) NOT NULL, Indexed.
Is this acceptable design?
You should create the column with type 'uniqueidentifier', which is intended for storing GUID values. Internally it will store it as a 16 byte (128 bit) integer instead of the much slower character string that you want to use.
Performance should be very good because you are just comparing 16 byte values for building the index, which is a pretty quick operation.
Related
I came across a sql code, which creates primary keys with hashbytes function and md5 algorithm. The code looks like this:
SELECT
CONVERT(VARBINARY(32),
CONVERT( CHAR(32),
HASHBYTES('MD5',
(LTRIM(RTRIM(COALESCE(column1,'')))+';'+LTRIM(RTRIM(COALESCE(column2,''))))
),
2)
)
FROM database.schema.table
I find it hard to understand for what is the result from hashbytes function is converted to char and then to varbinary, when we get directly varbinary from hashbytes function. Is there any good reason to do so?
Short Version
This code pads a hash with 0x20 bytes which is rather strange and most likely due to misunderstandings by the initial author. Using hashes as keys is a terrible idea anyway
Long Version
Hashes are completely inappropriate for generating primary keys. In fact, since the same hash can be generated from different original data, this code is guaranteed to produce duplicate values, causing collisions at best.
Worst case, you end up updating or deleting the wrong row, resulting in data loss. In fact, given that MD5 was broken over 20 years ago, one can calculate the values that would result in collisions. This has been used to hack systems in the past and even generate rogue CA certificates as far back as 2008.
And even worse, the concatenation expression :
(LTRIM(RTRIM(COALESCE(column1,'')))+';'+LTRIM(RTRIM(COALESCE(column2,''))))
Will create the same initial string for multiple different column values.
On top of that, given the random nature of hash values, this results in table fragmentation and an index that can't be used for range queries. Primary keys most of the time are clustered keys as well, which means they specify the order rows are stored on disk. Using essentially random values for a PK means new rows can be added at the middle or even the start of a table's data pages.
This also harms caching, as data is loaded from disk in pages. With a meaningful clustered key, it's highly likely that loading a specific row will also load rows that will be needed very soon. Loading eg 50 rows while paging may only need to load a single page. With an essentially random key, you could end up loading 50 pages.
Using a GUID generated with NEWID() would provide a key value without collisions. Using NEWSEQUENTIALID() would generate sequential GUID values eliminating fragmentation and once again allowing range searches.
An even better solution would be to just create a PK from the two columns :
ALTER TABLE ThatTable ADD PRIMARY KEY (Column1,Column2);
Or just add an IDENTITY-generated ID column. A bigint is large enough to handle all scenarios :
Create ThatTable (
ID bigint NOT NULL IDENTITY(1,1) PRIMARY KEY,
...
)
If the intention was to ignore spaces in column values there are better options:
The easiest solution would be to clean up the values when inserting them.
A CHECK constraint can be added to each column to ensure the columns can't have leading or trailing spaces.
An INSTEAD OF trigger can be used to trim them.
Computed, persisted columns can be added that trim the originals, eg Column1_Cleaned as TRIM(Column1) PERSISTED. Persisted columns can be used in indexes and primary keys
As for what it does:
It generates deprecation warnings (MD5 is deprecated)
It pads the MD5 hash with 0x20 bytes. A rather ... unusual way of padding data. I suspect whoever first wrote this wanted to pad the hash to 32 bytes but used some copy-pasta code without understanding the implications.
You can check the results by hashing any value. The following queries
select hashbytes('md5','banana')
----------------------------------
0x72B302BF297A228A75730123EFEF7C41
select cast(hashbytes('md5','banana') as char(32))
--------------------------------
r³¿)z"Šus#ïï|A
A space in ASCII is the byte 0x20. Casting to binary replaces spaces with 0x20, not 0x00
select cast(cast(hashbytes('md5','banana') as char(32)) as varbinary(32))
------------------------------------------------------------------
0x72B302BF297A228A75730123EFEF7C4120202020202020202020202020202020
If one wanted to pad a 16-byte value to 32 bytes, it would make more sense to use 0x00. The result is no better than the original though
select cast(hashbytes('md5','banana') as binary(32))
------------------------------------------------------------------
0x72B302BF297A228A75730123EFEF7C4100000000000000000000000000000000
To get a real 32-byte hash, SHA2_256 can be used :
select hashbytes('sha2_256','banana')
------------------------------------------------------------------
0xB493D48364AFE44D11C0165CF470A4164D1E2609911EF998BE868D46ADE3DE4E
My question is about performance on SQL server tables.
Assume I have a table that has many columns, for example 30 columns, with 1 column indexed. This table has approximately 30,000 rows.
If I perform a select that selects the indexed column, and one more, for example this:
SELECT IndexedColumn, column1
FROM table
Will this be slower than performing the same select on a table that only has 2 columns, and doing a SELECT * ...
So basically, will the existence of the extra columns slow down the select query event if I am not retrieving the data from the extra columns?
There will be minor difference on the very end of the process as you don't have to print/pass the rest of information for the end client (either SSMS or other app).
When performing a read based on clustered index all of the column (without BLOB) are saved on the same page set so to read the data you have to access the same set of pages anyway.
You would see a performance increase if you would have a nonclustered index on the column list you are after as then they are saved in their own structure of data pages (so it would be less to read).
Assuming that you are using the default Clustered Index created by SQL server when defining the primary key on the table in both scenarios then no, there shouldn't be any performance difference between these two scenarios. Maybe worth just checking it out and generating an Actual Execution plan to see for yourself? -- Actually not sure above is true, as given this is rowstore, the first table wont be able to fit as many rows onto each page so will suffer more of an IO/Disk overhead when reading data.
I have a table with this simple definition:
CREATE TABLE Related
(
RelatedUser NVARCHAR(100) NOT NULL FOREIGN KEY REFERENCES User(Id),
RelatedStory BIGINT NOT NULL FOREIGN KEY REFERENCES Story(Id),
CreationTime DateTime NOT NULL,
PRIMARY KEY(RelatedUser, RelatedStory)
);
with these indexes:
CREATE INDEX i_relateduserid
ON Related (RelatedUserId) INCLUDE (RelatedStory, CreationTime)
CREATE INDEX i_relatedstory
ON Related(RelatedStory) INCLUDE (RelatedUser, CreationTime)
And I need to query the table for all stories related to a list of UserIds, ordered by Creation Time, and then fetch only X and skip Y.
I have this stored procedure:
CREATE PROCEDURE GetStories
#offset INT,
#limit INT,
#input UserIdInput READONLY
AS
BEGIN
SELECT RelatedStory
FROM Related
WHERE EXISTS (SELECT 1 FROM #input WHERE UID = RelatedUser)
GROUP BY RelatedStory, CreationTime
ORDER BY CreationTime DESC
OFFSET #offset ROWS FETCH NEXT #limit ROWS ONLY;
END;
Using this User-Defined Table Type:
CREATE TYPE UserIdInput AS TABLE
(
UID nvarchar(100) PRIMARY KEY CLUSTERED
)
The table has 13 million rows, and gets me good results when using few userids as input, but very bad (30+ seconds) results when providing hundreds or a couple thousand userids as input. The main problem seems to be that it uses 63% of the effort on sorting.
What index am I missing? this seems to be a pretty straight forward query on a single table.
What types of values do you have for RelatedUser / UID ? Why, exactly, are you using NVARCHAR(100) for it? NVARCHAR is usually a horrible choice for a PK / FK field. Even if the value is a simple, alphanumeric code (e.g. ABTY1245) there are better ways of handling this. One of the main problems with NVARCHAR (and even with VARCHAR for this particular issue) is that, unless you are using a binary collation (e.g. Latin1_General_100_BIN2), every sort and comparison operation will apply the full range of linguistic rules, which can be well worth it when working with strings, but unnecessarily expensive when working with codes, especially when using the typically default case-insensitive collations.
Some "better" (but not ideal) solutions would be:
If you really do need Unicode characters, at least specify a binary collation, such as Latin1_General_100_BIN2.
If you do not need Unicode characters, then switch to using VARCHAR which will take up half the space and sort / compare faster. Also, still use a binary Collation.
Your best bet is to:
Add an INT IDENTITY column to the User table, named UseID
Make UserID the Clustered PK
Add an INT (no IDENTITY) column to the Related table, named UserID
Add an FK from Related back to User on UserID
Remove the RelatedUser column from the Related table.
Add a non-clustered, Unique Index to the User table on the UserCode column (this makes it an "alternate key")
Drop and recreate the UserIdInput User-Defined Table Type to have an INT datatype instead of NVARCHAR(100)
If at all possible, alter the ID column of the User table to have a binary collation (i.e. Latin1_General_100_BIN2)
If possible, rename the current Id column in the User table to be UserCode or something like that.
If users are entering in the "Code" values (meaning: cannot guarantee they will always use all upper-case or all lower-case), then best to add an AFTER INSERT, UPDATE Trigger on the User table to ensure that the values are always all upper-case (or all lower-case). This will also mean that you need to make sure that all incoming queries using the same all upper-case or all lower-case values when searching on the "Code". But that little bit of extra work will pay off.
The entire system will thank you, and show you its appreciation by being more efficient :-).
One other thing to consider: the TVP is a table-variable, and by default those only ever appear to the query optimizer to have a single row. So it makes some sense that adding a few thousand entries into the TVP would slow it down. One trick to help speed up TVP in this scenario is to add OPTION (RECOMPILE) to the query. Recompiling queries with table variables will cause the query optimizer to see the true row count. If that doesn't help any, the other trick is to dump the TVP table variable into a local temporary table (i.e. #TempUserIDs) as those do maintain statistics and optimize better when you have more than a small number of rows in them.
From O.P.'s comment on this answer:
[UID] is an ID used across our system (XXX-Y-ZZZZZZZZZZ...), XXX being letters, Y being a number and Z being numbers
Yes, I figured it was an ID or code of some sort, so that doesn't change my advice. NVARCHAR, especially if using a non-binary, case-insensitive collation, is probably one of the worst choices of datatype for this value. This ID should be in a column named UserCode in the User table with a non-clustered index defined on it. This makes it an "alternate" key and a quick and easy lookup from the app layer, one time, to get the "internal" integer value for that row, the INT IDENTITY column as the actual UserID (is usually best to name ID columns as {table_name}ID for consistency / easier maintenance over time). The UserID INT value is what goes into all related tables to be the FK. An INT column will JOIN much faster than an NVARCHAR. Even using a binary collation, this NVARCHAR column, while being faster than its current implementation, will still be at least 32 bytes (based on the given example of XXX-Y-ZZZZZZZZZZ) whereas the INT will be just 4 bytes. And yes, those extra 28 bytes do make a difference, especially when you have 13 million rows. Remember, this isn't just disk space that these values take up, it is also memory since ALL data that is read for queries goes through the Buffer Pool (i.e. physical memory!).
In this scenario, however, we're not following the foreign keys anywhere, but directly querying on them. If they're indexed, should it matter?
Yes, it still does matter since you are essentially doing the same operation as a JOIN: you are taking each value in the main table and comparing it to the values in the table variable / TVP. This is still a non-binary, case-insensitive (I assume) comparison that is very slow compared to a binary comparison. Each letter needs to be evaluated against not just upper and lower case, but against all other Unicode Code Points that could equate to each letter (and there are more than you think that will match A - Z!). The index will make it faster than not having an index, but nowhere near as fast as comparing one simple value that has no other representation.
So I finally found a solution.
While #srutzky had good suggestions of normalizing the tables by changing the NVARCHAR UserId to an Integer to minimize comparison cost, this was not what solved my problem. I will definitely do this at some point for the added theoretical performance, but I saw very little change in performance after implementing it right off the bat.
#Paparazzi suggested I added an index for (RelatedStory, CreationTime), and that did not do what I needed either. The reason was, that I also needed to also index RelatedUser as that's the way the query goes, and it groups and orders by both CreationTime and RelatedStory, so all three are needed. So:
CREATE INDEX i_idandtime ON Related (RelatedUser, CreationTime DESC, RelatedStory)
solved my problem, bringing my unacceptable query times of 15+ seconds down to mostly 1-second or a couple of seconds querytimes.
I think what gave me the revelation was #srutzky noting:
Remember, "Include" columns are not used for sorting or comparisons,
only for covering.
which made me realize I needed all my groupby and orderby columns in the index.
So while I can't mark either of the above posters post as the Answer, I'd like to sincerely thank them for their time.
The main problem seems to be that it uses 63% of the effort on
sorting.
ORDER BY CreationTime DESC
I would suggest and index on CreationTime
Or try an index on RelatedStory, CreationTime
I was trying to create an ID column in SQL server, VB.net that would provide a sequence of numbers for every new row created in a database. So I used the following technique to create the ID column.
select * from T_Users
ALTER TABLE T_Users
ADD User_ID INT NOT NULL IDENTITY(1,1) Primary Key
Then I registered few usernames into the database and it worked just fine. For example the first six rows would be 1,2,3,4,5,6. Then I registered 4 more users the NEXT day, but this time the ID numbers jumped from 6 to A very large number such as: 1,2,3,4,5,6,1002,1003,1004,1005. Then two days later, I registered two more users and the new rows read 3002,3004. So my question is why is it skipping such a large number every other day I register users. Is the technique I used to create the sequence wrong? If it is wrong can anyone please tell me how to do it right? Now as I was getting frustrated with the technique used above, alternatively I tried to use sequentially generated GUID values. The sequence of GUID values were generated fine. However, the only downside is, it generates a very long numbers (4 times the INT size). My question here is does using GUID have any significant advantage over INT?
Regards,
Upside of GUIDs:
GUIDs are good if you ever want offline clients to be able to create new records, as you will never get a primary key clash when the new records are synchronised back to the main database.
Downside of GUIDs:
GUIDS as primary keys can have an effect on the performance of the DB, because for a clustered primary key, the DB will want to keep the rows in order of the key values. But this means a lot of inserts between existing records, because the GUIDs will be random.
Using IDENTITY column doesn't suffer from this because the next record is guaranteed to have the highest value and so the row is just tacked on the end every time. No re-shuffle needs to happen.
There is a compromise which is to generate a pseudo-GUID which means you would expect a key clash every 70 years or so, but helps the indexing immensely.
The other downsides are that a) they do take up more storage space, and b) are a real pain to write SQL against, i.e. much easier to type UPDATE TABLE SET FIELD = 'value' where KEY = 50003 than UPDATE TABLE SET FIELD = 'value' where KEY = '{F820094C-A2A2-49cb-BDA7-549543BB4B2C}'
Your declaration of the IDENTITY column looks fine to me. The gaps in your key values are probably due to failed attempts to add a row. The IDENTITY value will be incremented but the row never gets committed. Don't let it bother you, it happens in practically every table.
EDIT:
This question covers what I was meaning by pseudo-GUID. INSERTs with sequential GUID key on clustered index not significantly faster
In SQL Server 2005+ you can use NEWSEQUENTIALID() to get a random value that is supposed to be greater than the previous ones. See here for more info http://technet.microsoft.com/en-us/library/ms189786%28v=sql.90%29.aspx
Is the technique I used to create the sequence wrong?
No. If anything your google skills are non-existing. A short look for "Sql server identity skipping values" will give you a TON of returns including:
SQL Server 2012 column identity increment jumping from 6 to 1000+ on 7th entry
and the canonical:
Why are there gaps in my IDENTITY column values?
You basically wrongly assume sql server will not optimize it's access for performance. Identity numbers are markers, nothing else, no assumption of having no gaps please.
In particular: SQL Server preallocates numbers in 1000 blocks and - if you restart the server (like on your workstation) the remainder is lost.
http://www.sqlserver-training.com/sequence-breaks-gap-in-numbers-after-restart-sql-server-gap-between-numbers-after-restarting-server/-
If you do a manual sqyuence instead (new nin sql server 2012) you can define the cache size for this (pregeneration) and set it to 1 - at the cost of slightly lower performance when you do a lot of inserts.
My question here is does using GUID have any significant advantage over INT?
Yes. You can have a lot more rows with GUID's than with int. For example, int32 is limited to about 2 billion rows. For some of us that is too low (I have tables in the 10 billion range) and even a 64 large int is limited. And a truly zetabyte database, you have to use a guid in sequence, self generated.
Any normal human does not see a difference as we all do not really deal with that many rows. And the larger size makes a lot of things slower (larger key size = larger space in indices = larger indices = more memory / io for the same operation). Plus even your sequential id will jump.
Why not just adjust your expectation to reality - identity is not meant to be without gaps - or use a sequence with cache 1.
I have a purely academic question about SQLite databases.
I am using SQLite.net to use a database in my WinForm project, and as I was setting up a new table, I got to thinking about the maximum values of an ID column.
I use the IDENTITY for my [ID] column, which according to SQLite.net DataType Mappings, is equivalent to DbType.Int64. I normally start my ID columns at zero (with that row as a test record) and have the database auto-increment.
The maximum value (Int64.MaxValue) is 9,223,372,036,854,775,807. For my purposes, I'll never even scratch the surface on reaching that maximum, but what happens in a database that does? While trying to read up on this, I found that DB2 apparently "wraps" the value around to the negative value (-9,223,372,036,854,775,807) and increments from there, until the database can't insert rows because the ID column has to be unique.
Is this what happens in SQLite and/or other database engines?
I doubt anybody knows for sure, because if a million rows per second were being inserted, it would take about 292,471 years to reach the wrap-around-risk point -- and databases have been around for a tiny fraction of that time (actually, so has Homo Sapiens;-).
IDENTITY is not actually the proper way to auto-increment in SQLite. That will require you do the incrementing in the app layer. In the SQLite shell, try:
create table bar (id IDENTITY, name VARCHAR);
insert into bar (name) values ("John");
select * from bar;
You will see that id is simply null. SQLite does not give any special significance to IDENTITY, so it is basically an ordinary (untyped) column.
On the other hand, if you do:
create table baz (id INTEGER PRIMARY KEY, name VARCHAR);
insert into baz (name) values ("John");
select * from baz;
it will be 1 as I think you expect.
Note that there is also a INTEGER PRIMARY KEY AUTOINCREMENT. The basic difference is that AUTOINCREMENT ensures keys are never reused. So if you remove John, 1 will never be reused as a id. Either way, if you use PRIMARY KEY (with optional AUTOINCREMENT) and run out of ids, SQLite is supposed to fail with SQLITE_FULL, not wrap around.
By using IDENTITY, you do open the (probably irrelevant) likelihood that your app will incorrectly wrap around if the db were ever full. This is quite possible, because IDENTITY columns in SQLite can hold any value (including negative ints). Again, try:
insert into bar VALUES ("What the hell", "Bill");
insert into bar VALUES (-9, "Mary");
Both of those are completely valid. They would be valid for baz too. However, with baz you can avoid manually specifying id. That way, there will never be junk in your id column.
The documentation at http://www.sqlite.org/autoinc.html indicates that the ROWID will try to find an unused value via randomization once it reached its maximum number.
For AUTOINCREMENT it will fail with SQLITE_FULL on all attempts to insert into this table, once there was a maximum value in the table:
If the table has previously held a row with the largest possible ROWID, then new INSERTs are not allowed and any attempt to insert a new row will fail with an SQLITE_FULL error.
This is necessary, as the AUTOINCREMENT guarantees that the ID is monotonically increasing.
I can't speak to any specific DB2 implementation logic, but the "wrap around" behavior you describe is standard for numbers that implement signing via two's complement.
As for what would actually happen, that's completely up in the air as to how the database would handle it. The issue arises at the point in time of actually CREATING the id that's too large for the field, as it's unlikely that the engine internally uses a data type of more than 64 bits. At that point, it's anybody's guess...the internal language used to develop the engine could throw up, the number could silently wrap around and just cause a primary key violation (assuming that a conflicting ID existed), the world could come to an end due to your overflow, etc.
But pragmatically, Alex is correct. The theoretical limit on the number of rows involved here (assuming it's a one-id-per row and not any sort of cheater identity insert shenanigans) would basically render the situation moot, as by the time that you could conceivably enter that many rows at even a stupendous insertion rate we'll all dead anyway, so it doesn't matter :)