I have a table with 3 columns:
PersonId uniqueidentifier -- key
DeviceId uniqueidentifier -- key
Counter bigint
The counter comes in ascending value but sometimes has gaps. An example of the counter values is (1,2,3,1000,10000,10001,10002,...). The counter value is saved one at a time. If I insert one row per counter value, the table gets big very fast. I must keep the last 1000 counter values and can delete early values.
Is it possible to concatenate the counter values into 1 or a few rows in a varbinary(8000) type, and remove early values at the beginning of the binary as part of the insert operation? I would like help in writing this query. I prefer not to use varchar because each character would take up 2 bytes. There may be better way than I can envision. Any help is appreciated!
Why are you trying to do that versus using a table with PersonID, DeviceID, Counter, and allow only a certain number of Counters per PersonID and DeviceID pair?
If your goal is to save space, remember that varbinary(8000) is going to reserve 8000 bytes, allowing 1000 bigint values maximum, with no consideration of how many counters you have.
How likely is it that most of these PersonID and DeviceID pair will have 1000 counters?
In the end, you are just making it more complicated for yourself and harder to maintain by future employees but are you really saving space?
You are also going to have to add some transaction process which will take away more of your server's resources.
But to answer your question strictly: yes, it is possible. I guess a trigger or a process at the end of a sproc could handle what you are trying to do.
Maybe you can choose an in-between solution: Make each row have 10 or so columns of type bigint. That will cut the number of rows by ten and reduce per-row overhead by 10x also. Maybe that is enough for you.
Related
We are being given a range of numbers which need to be used on some of our orders.
We must ensure that each number is allocated only once.
The range of numbers may be more than a million items (not sure at this stage how many digits but assuming less than 10).
I was thinking of prepopulating a table with these numbers so that every time one is used we can update that row with perhaps the orderid.
Table would be:
Id int, Code int, OrderId uniqueidentifier, Date datetime
When getting the latest number for allocating to an order we would find the min(Code) where OrderId is null
Id would be Primary Key but Clustered index will be set on Code column as that will be used for most important query.
Does this sound like a good way to go about doing this? Is there a better way?
I have a table with this simple definition:
CREATE TABLE Related
(
RelatedUser NVARCHAR(100) NOT NULL FOREIGN KEY REFERENCES User(Id),
RelatedStory BIGINT NOT NULL FOREIGN KEY REFERENCES Story(Id),
CreationTime DateTime NOT NULL,
PRIMARY KEY(RelatedUser, RelatedStory)
);
with these indexes:
CREATE INDEX i_relateduserid
ON Related (RelatedUserId) INCLUDE (RelatedStory, CreationTime)
CREATE INDEX i_relatedstory
ON Related(RelatedStory) INCLUDE (RelatedUser, CreationTime)
And I need to query the table for all stories related to a list of UserIds, ordered by Creation Time, and then fetch only X and skip Y.
I have this stored procedure:
CREATE PROCEDURE GetStories
#offset INT,
#limit INT,
#input UserIdInput READONLY
AS
BEGIN
SELECT RelatedStory
FROM Related
WHERE EXISTS (SELECT 1 FROM #input WHERE UID = RelatedUser)
GROUP BY RelatedStory, CreationTime
ORDER BY CreationTime DESC
OFFSET #offset ROWS FETCH NEXT #limit ROWS ONLY;
END;
Using this User-Defined Table Type:
CREATE TYPE UserIdInput AS TABLE
(
UID nvarchar(100) PRIMARY KEY CLUSTERED
)
The table has 13 million rows, and gets me good results when using few userids as input, but very bad (30+ seconds) results when providing hundreds or a couple thousand userids as input. The main problem seems to be that it uses 63% of the effort on sorting.
What index am I missing? this seems to be a pretty straight forward query on a single table.
What types of values do you have for RelatedUser / UID ? Why, exactly, are you using NVARCHAR(100) for it? NVARCHAR is usually a horrible choice for a PK / FK field. Even if the value is a simple, alphanumeric code (e.g. ABTY1245) there are better ways of handling this. One of the main problems with NVARCHAR (and even with VARCHAR for this particular issue) is that, unless you are using a binary collation (e.g. Latin1_General_100_BIN2), every sort and comparison operation will apply the full range of linguistic rules, which can be well worth it when working with strings, but unnecessarily expensive when working with codes, especially when using the typically default case-insensitive collations.
Some "better" (but not ideal) solutions would be:
If you really do need Unicode characters, at least specify a binary collation, such as Latin1_General_100_BIN2.
If you do not need Unicode characters, then switch to using VARCHAR which will take up half the space and sort / compare faster. Also, still use a binary Collation.
Your best bet is to:
Add an INT IDENTITY column to the User table, named UseID
Make UserID the Clustered PK
Add an INT (no IDENTITY) column to the Related table, named UserID
Add an FK from Related back to User on UserID
Remove the RelatedUser column from the Related table.
Add a non-clustered, Unique Index to the User table on the UserCode column (this makes it an "alternate key")
Drop and recreate the UserIdInput User-Defined Table Type to have an INT datatype instead of NVARCHAR(100)
If at all possible, alter the ID column of the User table to have a binary collation (i.e. Latin1_General_100_BIN2)
If possible, rename the current Id column in the User table to be UserCode or something like that.
If users are entering in the "Code" values (meaning: cannot guarantee they will always use all upper-case or all lower-case), then best to add an AFTER INSERT, UPDATE Trigger on the User table to ensure that the values are always all upper-case (or all lower-case). This will also mean that you need to make sure that all incoming queries using the same all upper-case or all lower-case values when searching on the "Code". But that little bit of extra work will pay off.
The entire system will thank you, and show you its appreciation by being more efficient :-).
One other thing to consider: the TVP is a table-variable, and by default those only ever appear to the query optimizer to have a single row. So it makes some sense that adding a few thousand entries into the TVP would slow it down. One trick to help speed up TVP in this scenario is to add OPTION (RECOMPILE) to the query. Recompiling queries with table variables will cause the query optimizer to see the true row count. If that doesn't help any, the other trick is to dump the TVP table variable into a local temporary table (i.e. #TempUserIDs) as those do maintain statistics and optimize better when you have more than a small number of rows in them.
From O.P.'s comment on this answer:
[UID] is an ID used across our system (XXX-Y-ZZZZZZZZZZ...), XXX being letters, Y being a number and Z being numbers
Yes, I figured it was an ID or code of some sort, so that doesn't change my advice. NVARCHAR, especially if using a non-binary, case-insensitive collation, is probably one of the worst choices of datatype for this value. This ID should be in a column named UserCode in the User table with a non-clustered index defined on it. This makes it an "alternate" key and a quick and easy lookup from the app layer, one time, to get the "internal" integer value for that row, the INT IDENTITY column as the actual UserID (is usually best to name ID columns as {table_name}ID for consistency / easier maintenance over time). The UserID INT value is what goes into all related tables to be the FK. An INT column will JOIN much faster than an NVARCHAR. Even using a binary collation, this NVARCHAR column, while being faster than its current implementation, will still be at least 32 bytes (based on the given example of XXX-Y-ZZZZZZZZZZ) whereas the INT will be just 4 bytes. And yes, those extra 28 bytes do make a difference, especially when you have 13 million rows. Remember, this isn't just disk space that these values take up, it is also memory since ALL data that is read for queries goes through the Buffer Pool (i.e. physical memory!).
In this scenario, however, we're not following the foreign keys anywhere, but directly querying on them. If they're indexed, should it matter?
Yes, it still does matter since you are essentially doing the same operation as a JOIN: you are taking each value in the main table and comparing it to the values in the table variable / TVP. This is still a non-binary, case-insensitive (I assume) comparison that is very slow compared to a binary comparison. Each letter needs to be evaluated against not just upper and lower case, but against all other Unicode Code Points that could equate to each letter (and there are more than you think that will match A - Z!). The index will make it faster than not having an index, but nowhere near as fast as comparing one simple value that has no other representation.
So I finally found a solution.
While #srutzky had good suggestions of normalizing the tables by changing the NVARCHAR UserId to an Integer to minimize comparison cost, this was not what solved my problem. I will definitely do this at some point for the added theoretical performance, but I saw very little change in performance after implementing it right off the bat.
#Paparazzi suggested I added an index for (RelatedStory, CreationTime), and that did not do what I needed either. The reason was, that I also needed to also index RelatedUser as that's the way the query goes, and it groups and orders by both CreationTime and RelatedStory, so all three are needed. So:
CREATE INDEX i_idandtime ON Related (RelatedUser, CreationTime DESC, RelatedStory)
solved my problem, bringing my unacceptable query times of 15+ seconds down to mostly 1-second or a couple of seconds querytimes.
I think what gave me the revelation was #srutzky noting:
Remember, "Include" columns are not used for sorting or comparisons,
only for covering.
which made me realize I needed all my groupby and orderby columns in the index.
So while I can't mark either of the above posters post as the Answer, I'd like to sincerely thank them for their time.
The main problem seems to be that it uses 63% of the effort on
sorting.
ORDER BY CreationTime DESC
I would suggest and index on CreationTime
Or try an index on RelatedStory, CreationTime
Is it possible to force IDENTITY column to recalculate its seed property , when it reach maximum value for defined data type, to fill gaps in ID's.
Let say like this way, I have a column of TINYINT data type which can hold up values to maximum of 255. When column is filled with data to maximum ID possible, I delete one row from middle, let's say ID = 100.
Question is, can I force IDENTITY to fill that missing ID at the end?
You can reseed the IDENTITY (set a new seed), but it will NOT be able to magically find the missing values.....
The reseeded IDENTITY column will just keep handing out new values starting at the new seed - which means, at some point, sooner or later, collisions with already existing values will happen
Therefore, all in all, reseeding an IDENTITY really isn't a very good idea .... Just pick a data type large enough to handle your needs.
With a type INT, starting at 1, you get over 2 billion possible rows - that should be more than sufficient for the vast majority of cases. With BIGINT, you get roughly 922 quadrillion (922 with 15 zeros - 9'220'000 billions) - enough for you??
If you use an INT IDENTITY starting at 1, and you insert a row every second, you need 66.5 years before you hit the 2 billion limit ....
If you use a BIGINT IDENTITY starting at 1, and you insert one thousand rows every second, you need a mind-boggling 292 million years before you hit the 922 quadrillion limit ....
Read more about it (with all the options there are) in the MSDN Books Online.
Yes,you can do this:
set identity_insert on;
insert into table(id) value(100)
--set it off
set identity_insert off;
Identity Insert
Thanks to the wonderful article The Cost of GUIDs as Primary Keys, we have the COMB GUID. Based on current implementation, there are 2 approaches:
use last 6 bytes for timestamp: GUIDs as fast primary keys under multiple databases
use last 8 bytes for timestamp by using windows tick: GUID COMB strategy in EF4.1 (CodeFirst)
We all know that for 6 bytes timestamp at GUID, there would more bytes for random bytes to reduce the collision of the GUID. However more GUID with same timestamp would be created and those are not sequential at all. With that, 8 bytes timestamp would be preferred.
So it seems a hard choice. Based on article above GUIDs as fast primary keys under multiple databases, it says:
Before we continue, a short footnote about this approach: using a 1-millisecond-resolution timestamp means that GUIDs generated very close together might have the same timestamp value, and so will not be sequential. This might be a common occurrence for some applications, and in fact I experimented with some alternate approaches, such as using a higher-resolution timer such as System.Diagnostics.Stopwatch, or combining the timestamp with a "counter" that would guarantee the sequence continued until the timestamp updated. However, during testing I found that this made no discernible difference at all, even when dozens or even hundreds of GUIDs were being generated within the same one-millisecond window. This is consistent with what Jimmy Nilsson encountered during his testing with COMBs as well
Just wonder if someone who knows database internal could share some lights about above observation. Is it because that database server just store the data in the memory and only write to disk when it reaches certain threshold? Thus the reorder of inserted data with non sequence GUID with same time stamp would happen in general in memory and thus minimal performance penalty.
Update:
Based on our testing, the COMB GUID could not reduce the table fragmentation as it is claimed over the internet compared with random GUID. It seems the only way right now is to use SQL Server to generate the sequential GUID.
The article you referenced is from 2002 and is very old. Just use newsequentialid (available in SQL Server 2005 and up). This guarantees that each new id you generate is greater than the previous one, solving the index fragmentation/page split issue.
Another aspect I'd like to mention, though, that the writer of that article glossed over, is that using 16 bytes when you only need 4 is not a good idea. Let's say you have a table with 500,000 rows averaging 150 bytes not including the clustered column, and the table has 3 nonclustered indexes (which repeat the clustered column in each row), each in turn with rows averaging 4 bytes, 25 bytes, and 50 bytes not counting the clustered column.
The storage requirements at perfect 100% fill factor are then (all numbers in megabytes except where %):
Item Clust 50 25 4 Total
---- ----- ----- ----- ----- ------
GUID 79.1 31.5 19.6 9.5 139.7
int 73.4 25.7 13.8 3.8 116.7
%imp 7.2% 18.4% 29.6% 60.0% 16.5%
In the nonclustered index having just one int column of 4 bytes (a common scenario), switching the clustered index to an int makes it 60% smaller! This translates directly into a 60% performance improvement for any scans on the table--and that's conservative, because with smaller rows, page splits will occur less often and the fragmentation will stay better longer.
Even in the clustered index itself, there's still a 7.2% performance improvement, which is not nothing, at all.
What if you used GUIDs throughout your entire database, which had tables with a similar profile as this where switching to int would yield a 16.5% reduction in size, and the database itself was 1.397 Terabytes in size? Your whole database would be 230 Gb larger (refer to the Total column, 139.7 - 116.7). That translates into real money in the real world for high-availability storage. It moves your disk purchase schedule earlier in time which is harmful to your company's bottom line.
Do not use larger data types than necessary, ever. It's like adding weight to your car for no reason: you will pay for it (if not in speed, then in fuel economy).
UPDATE
Now that I know you are creating the GUID in your client-side code, I can see more clearly the nature of your problem. If you are able to defer creating the GUID until row insertion time, here's one way to accomplish that.
First, set a default for your CustomerID column:
ALTER TABLE dbo.Customer ADD CONSTRAINT DF_Customer_CustomerID
DEFAULT (newsequentialid()) FOR Customer;
Now you don't have to specify what value to insert for CustomerID in any INSERT, and your query could look like this:
DECLARE #Name varchar(100) = 'Acme Spy Devices';
INSERT dbo.Customer (Name)
OUTPUT inserted.CustomerID -- a GUID
VALUES (#Name);
In this very simple example, you have inserted a new row to the Customer table, and returned a rowset to the client containing the just-created value, all in one query.
If you wanted to explicitly insert VALUES (newsequentialid(), #Name) that would work, too.
I was given a ragtag assortment of data to analyze and am running into a predicament. I've got a ~2 million row table with a non-unique identifier of datatype varchar(50). This identifier is unique to a personID. Until I figure out exactly how I need to normalize this junk I've got another question that might help me right now: If I change the datatype to a varchar(25) for instance, will that help queries run faster when they're joined on a non-PK field? All of the characters in the string are integers, but trying to convert them to an int would cause overflow. Or could I possibly somehow index the column for the time being to get some of the queries to run faster?
EDIT: The personID will be a foreign key to another table with demographic information about a person.
Technically, the length of a varchar specifies it's maximum length.
The actual length is variable (thus the name) so a lower maximum value won't change the evaluation because it will be made on the actual string.
For more information :
Check this MSDN article and this
Stack overflow Post
Varchar(50) to varchar(25) would certainly reduce the size of record in that table, thereby reducing the number of database pages that contain the table, improving the perfomance of queries (may be to a marginal extent), but such an ALTER TABLE statement might take a long time.
Alternatively, if you define index on the join columns, and if your retrieval list is small, you can include those columns also in the index definition (Covering index), that too would bring down the query execution times significantly.