im using sql server 2012, is it possible to generate a uniqueidentifier value based on two or three values mostly varchars or decimal, i mean any data type which takes 0-9 and a-z.
Usually uniqueidentifier varies from system to system. For my requirement, I need a custom one, when ever i call this function, it should get me the same value in all the systems.
I have been thinking of converting the values into varbinary and taking certain parts of it and generating a uniqueidentifier. How good is this approach.
Im still working on this approach.
Please provide your suggestions.
What you describe is a hash of the values. Use HASHBYTES function to digest your values into a hash. But your definition of the problem contradicts the requirement for uniqueness since, by definition, reducing an input of size M to a hash of size N, where N < M, may generate collisions. If you truly need uniqueness then redefine the requirements in a manner which would at least allow for uniqueness. Namely, the requirement for it should get me the same value in all the systems must be dropped since the only way to guarantee it is to output exactly the input. If you remove this requirement then the new requirement are satisfied by NEWID() (yes, it does not consider the input, but it doesn't have to in order to meet your requirements).
The standards document for Uniqueidentifier goes to some length showing how they are generated. http://www.ietf.org/rfc/rfc4122.txt
I would give this a read (especially 4.1.2. as that breaks down how a guid should be generated) and maybe I would keep use the timestamp components but hard code your network location element which will give you what you are looking for.
Related
Can anyone explain to me, and maybe propose a better approach.
Why is checksum(0.0280) = checksum(-0.0280) ?
Casting to float would solve it, but I’m reluctant to do it, and I would rather find a way around this.
LE: I was trying to keep things simple, as with most questions around here, this is something that has come up in production, and putting the entire database structure is a bit of an overkill.
I will try to explain it a bit better. I have some dynamic structure tables (dynamic in the sense that the enduser controls the structure through a web application) that have the following rough structure: Id (int), StartDate, FKey1 (nvarchar), Value1 (decimal or nvarchar or int), Value2 ... ValueN.
This tables can be filled (again, by the end user) with redundant data (millions of rows) and during some calculations I would like to declutter this table leaving only relevant information. The way to declutter it, is to remove consecutive identical rows (except for the date). For the sake of performance I wanted to avoid checking each column individually, so CHECKSUM came in handy because it also supports multiple columns as input.
If you were thinking that there is one and only one possible value for every possible CHECKSUM, you were mistaken.
From the documentation:
If at least one of the values in the expression list changes, the list
checksum will probably change. However, this is not guaranteed.
Therefore, to detect whether values have changed, we recommend use of
CHECKSUM only if your application can tolerate an occasional missed
change. Otherwise, consider using HashBytes instead. With a specified MD5 hash algorithm, the probability that HashBytes will return the same result, for two different inputs, is much lower compared to CHECKSUM.
If you want to research it further, you might Google CHECKSUM collisions.
With a hashing function (like CHECKSUM) there will always be the risk of collisions.
You can try another (slower) hash function (like HashBytes as mentioned by #TabAlleman) or you can try out some homemade attempts that might perform better than HashBytes (but this should be tested), and that fits better to your anticipation of what numbers you expect coming in. So this is a trade-off: Performance versus collision risk. Here are 2 such homemade attempts that will give a different result for numbers that are equal except for the sign. Please notice that these variants will also produce collisions, but most likely for other differences than simply their sign.
select checksum(.028, floor(.28))
select checksum(-.028, floor(-.28))
select checksum(.028) + sign(.28)
select checksum(-.028) + sign(-.28)
When you said you could solve it by casting to a float, but still did not want to do that, I wonder if that was out of performance considerations. If so, I'm not sure my variants will perform better than casting to a float. Have a go at measuring that yourself :-)
I am well aware that if I use a nvarchar field as a primary key, or as a foreign key, that this will add some time and space overhead to the usage of the generated index in the majority (if not all) of cases.
As a general rule, using numeric keys are a good idea but under certain common circumstances (small sets of data for instance) it isn't a problem to use text based keys.
However, I am wondering if anyone could provide rigorous information on whether is it MORE efficient, or at least equal, to use text for database keys rather than numeric values under certain circumstances.
Consider a case where a table contains a short list of records. For our example, we'll say we need 50 records. Each record needs an ID. You could use, generic int (or even smallint) numbers (e.g. [1...50]) OR you could assign meaningful, 2 character values to a char(2) field (e.g. [AL, AK, AZ, AR, ... WI]).
In the above case, we could assume that using a char(2) field is potentially more efficient than using an int key since the char data is 2-bytes, vs. 4-bytes used with a int. Using a smallint field theoretically be just as efficient as the char(2) field and, possibly, a varchar(2) field.
The benefit from using the text based key over the numeric key is that the values are readable, which should make it obvious to many that my list of 50 records is likely a list of US States,
As stated, using keys that are smaller or equal in size of a comparable numeric key should be of similar efficiency. However, depending on the architecture and design of the database engine it is possible that in-practice usage may yield unexpected results.
With that stated, is it ever more, equal or less efficient to use any form of text-based value as a key within SQL Server?
I don't need obsessively thorough research results (though I wouldn't mind it), but I am looking for an answer that goes beyond stating what we would expect from a database.
Definitively, how does efficiency of text-based keys compare to numeric-based keys as the size of the text key increases/decreases?
In most cases considerations driven by the business requirements (use cases) will far outweigh any performance differences between numeric v. text keys. Unless you are looking at very large and/or very high throughput systems your choice of primary key type should be based on how the keys will be used rather than any small difference in performance you will see between numeric and text keys.
Think in assembly to find out the answer. You stated this:
we could assume that using a char(2) field is potentially more efficient than using an int key since the char data is 2-bytes, vs. 4-bytes used with a int. Using a smallint field theoretically be just as efficient as the char(2) field and, possibly, a varchar(2) field.
This isn't true, as you can't move 2 characters simultaneously in a single instruction (to my knowledge). So even as a char is smaller than a 4-byte int, you have to move them one-by-one into the register to do a comparison. To compare two instances of a 4-byte int, even if it is larger in size, you only need 1 move instruction per int (disregarding that you also need to move them out of the register back into the memory).
So what happens if you use an int:
Move one of them into one register
Move the other into another
Do a comparison operation
Move to appropriate memory location depending on the comparison result
In the case of a char, however:
Move one of them into one register
Move the other into another
Do a comparison
If you are lucky, and the order can be determined, then done, and the cost is the same as that in the case of ints.
If they are equal, rinse and repeat using the subsequent characters until the order or equality can be determined. Obviously, this is more costly.
Point is that on low level, the determining factor is not the data size in this case but the number of instructions needed.
Apart from the low-level stuff:
Yes, there might be cases where it simply doesn't matter because of the small amount of data that are not likely to ever change - chemical symbols of primitive elements for example (though I am not sure whether I'd use them as PKs).
Generally, you don't use artificial PKs for time and space considerations, but because if they don't have anything to do with in-real-life stuff, they are not subject of change. Can you imagine that the name of a US state ever changes? I can. If it happens, you would have to update the record itself (if the abbreviation changes too, ofc.), and all other records that reference it. If you use an int instead, then your record will have nothing to do with what happens in reality, in which case you only have to update the abbreviation and the state name itself and you can sit back assured that everything is consistent.
Comparing short strings is not always as trivial as comparing the numeric value of their binary representations. When you also have to consider internationalization, you need to rely on custom (or framework/platform-provided) logic to compare them. To use my language as an example, the letter 'Á' has a decimal value of 193, which is greater than the value of 66 of letter 'B', yet, in the Hungarian alphabet, 'Á' preceedes 'B'.
Using textual data rather than an arificial numeric PK can also cause some fragmentation and the write operations are likely to be slower. The reason for this is that an artificial, monotonously increasing numeric PK will cause your newly created rows to be inserted to the end of the table in all cases thereby avoiding the need to "move stuff around to free up space in between".
I am wondering about a basic database design / data type question I am having.
I have a porjects table with a field called "experience_required". I know this field will be always populated from one of these options: intern, junior, senior, or director. This list may vary a bit as time evolves but I don't expect dramatic changes to the items on it.
Should I go for integer or string? In the future when I have tons of records like this and need to retrieve them by expeirence_required, will it make a difference to have them in integers?
You may like this field indexed. Once indexed Integer and small Char String don't have much (read negligible) performance difference.
Definitely go for Integer over String.
Performance will be better, and your database will be closer to being normalized.
Ultimately, you should create a new table called ExperienceLevel, with fields Id and Title. The experience_required field in the existing table should be changed to a foreign key on the other table.
This will be a much stronger design, and will be more forgiving in the case that you change the experience levels available, or decide to rename an experience level.
You can read more about Normalization here.
Integers. Strings should IMHO only be used to store textual data (names, addresses, text, etc).
Besides, integers are in this case better for sorting, storage space and maintaining.
In theory integers will take less memory when you index them.
You can also use enums (in mysql) which look like strings but stored as integers.
Doesn't matter. The difference would be negligible. What difference there is would favor the choice of integer, but this is one of the few cases in which I prefer a short text key since it will save a JOIN back to a lookup table in many reporting situations.
To muddy the waters some, I'll suggest a mix. Start with #GregSansom's idea (upvoted), but instead of integers use the CHAR(1) datatype, with values I, J, S, and D. This will give you the same performance as using tinyint, and give the extra advantage of a simple to remember mnemonic when (if) working directly with the data. With a bit of use, it is trivial to remember that "S" means "senior", whereas 3 does not carry any built in meaning--particularly if, as you suggest, extra values are added over time. (Add Probationary as, say, 5, and the "low rank = low value" paradigm is out the window.)
This only works if you have a very short list of items. Get too many or too similar, and it's hard to work up usable codes.
Of course, what if these are sequential values? Sure sounds like it here. In that case, don't make them 1,2,3,4, make them 10, 20, 30, 40, so you can insert new categorizations later on. This would also allow you to easily implement ranges, such as "everyone < 30" (meaning less than "senior").
I guess my main point is: know your data, how it will be used, how it may or will change over time, and plan and code accordingly!
I have a database with a field that holds permit numbers associated with requests. The permit numbers are 13 digits, but a permit may not be issued.
With that said, I currently have the field defined as a char(13) that allows NULLs. I have been asked to change it to varchar(13) because char's, if NULL, still use the full length.
Is this advisable? Other than space usage, are there any other advantages or disadvantages to this?
I know in an ideal relational system, the permit numbers would be stored in another related table to avoid the use of NULLs, but it is what it is.
Well, if you don't have to use as much space, then you can fit more pages in memory. If you can do that, then your system will run faster. This may seem trivial, but I just recently tweaked the data types on a a table at a client that reduced the amount of reads by 25% and the CPU by about 20%.
As for which is easier to work with, the benefits David Stratton mentioned are noteworthy. I hate having to use trim functions in string building.
If the field should always be exactly 13 characters, then I'd probably leave it as CHAR(13).
Also, an interesting note from BOL:
If SET ANSI_PADDING is OFF when either
CREATE TABLE or ALTER TABLE is
executed, a char column that is
defined as NULL is handled as varchar.
Edit: How frequently would you expect the field to be NULL? If it will be populated 95% of the time, it's hardly worth it to make this change.
The biggest advantage (in general, not necessarily your specific case) I know of is that in code, if you use varchar, you don't have to use a Trim function every time you want it displayed. I run into this a lot when taking FirstName fields and LastName fields and combining them into a FullName. It's just annoying and makes the code less readable.
if your are using sql server 2008, you should look at Row Compression and perhaps sparse fields if the column is more ~60% nulls.
I would keep the datatype a char(13) if all of the populated fields use that amount.
Row Compression Information:
http://msdn.microsoft.com/en-us/library/cc280449.aspx
Sparse columns:
http://msdn.microsoft.com/en-us/library/cc280604.aspx
I have a situation where I need to store a general piece of data (could be an int, float, or string) in my database, but I don't know ahead of time which it will be. I need a table (or less preferably tables) to store this unknown typed data.
What I think I am going to do is have a column for each data type, only use one for each record and leave the others NULL. This requires some logic above the database, but this is not too much of a problem because I will be representing these records in models anyway.
Basically, is there a best practice way to do something like this? I have not come up with anything that is less of a hack than this, but it seems like this is a somewhat common problem. Thanks in advance.
EDIT: Also, is this considered 3NF?
You could easily do that if you used SQLite as a database backend :
Any column in a version 3 database, except an INTEGER PRIMARY KEY column, may be used to store any type of value.
For other RDBMS systems, I would go with Philip's solution.
Note that in my line of software (business applications), I cannot think of any situation where this kind of requirement would be needed (a value with an unknown datatype). Unless the domain model was flawed, of course... I can imagine that other lines of software may incur different practices, but I suggest that you consider rethinking your overall design.
If your application can reliably convert datatypes, you might consider a single column solution based on a variable-length binary column, with a second column to track original data type. (I did a very small routine based on this once before, and it worked well enough.) Testing would show if conversion is more efficiently handled on the application or database side.
If I were to do this I would choose either your method, or I would cast everything to string and use only one column. Of course there would be another column with the type (which would probably be useful for the first method too).
For faster code I would probably go with your method.