SQL Server CHECKSUM function issue - sql-server

Can anyone explain to me, and maybe propose a better approach.
Why is checksum(0.0280) = checksum(-0.0280) ?
Casting to float would solve it, but I’m reluctant to do it, and I would rather find a way around this.
LE: I was trying to keep things simple, as with most questions around here, this is something that has come up in production, and putting the entire database structure is a bit of an overkill.
I will try to explain it a bit better. I have some dynamic structure tables (dynamic in the sense that the enduser controls the structure through a web application) that have the following rough structure: Id (int), StartDate, FKey1 (nvarchar), Value1 (decimal or nvarchar or int), Value2 ... ValueN.
This tables can be filled (again, by the end user) with redundant data (millions of rows) and during some calculations I would like to declutter this table leaving only relevant information. The way to declutter it, is to remove consecutive identical rows (except for the date). For the sake of performance I wanted to avoid checking each column individually, so CHECKSUM came in handy because it also supports multiple columns as input.

If you were thinking that there is one and only one possible value for every possible CHECKSUM, you were mistaken.
From the documentation:
If at least one of the values in the expression list changes, the list
checksum will probably change. However, this is not guaranteed.
Therefore, to detect whether values have changed, we recommend use of
CHECKSUM only if your application can tolerate an occasional missed
change. Otherwise, consider using HashBytes instead. With a specified MD5 hash algorithm, the probability that HashBytes will return the same result, for two different inputs, is much lower compared to CHECKSUM.
If you want to research it further, you might Google CHECKSUM collisions.

With a hashing function (like CHECKSUM) there will always be the risk of collisions.
You can try another (slower) hash function (like HashBytes as mentioned by #TabAlleman) or you can try out some homemade attempts that might perform better than HashBytes (but this should be tested), and that fits better to your anticipation of what numbers you expect coming in. So this is a trade-off: Performance versus collision risk. Here are 2 such homemade attempts that will give a different result for numbers that are equal except for the sign. Please notice that these variants will also produce collisions, but most likely for other differences than simply their sign.
select checksum(.028, floor(.28))
select checksum(-.028, floor(-.28))
select checksum(.028) + sign(.28)
select checksum(-.028) + sign(-.28)
When you said you could solve it by casting to a float, but still did not want to do that, I wonder if that was out of performance considerations. If so, I'm not sure my variants will perform better than casting to a float. Have a go at measuring that yourself :-)

Related

Should I store this in the database or in the code?

I'm creating a small game composed of weapons. Weapons have characteristics, like the accuracy. When a player crafts such a weapon, a value between min and max are generated for each characteristic. For example, the accuracy of a new gun is a number between 2 and 5.
My question is... should I store the minimum and maximum value in the database or should it be hard coded in the code ?
I understand that putting them in the database allows me to change these values easily, however these won't change very often and doing this mean having to make a database request when I need these values. Moreover, its means having way much more tables... however, is it a good practice to store this directly in the code ?
In conclusion, I really don't know what solution to chose as both have advantages and disadvantage.
If you have attributes of an entity, then you should store them in the database.
That is what databases are for, storing data. I can see no advantage to hardcoding such values. Worse, the values might be used in different places in your code. And, when you update them, you might end up with inconsistent values throughout the code.
EDIT:
If these are default values, then I can imagine storing them in the code along with all the other information about the weapon -- name of the weapon, category, and so on. Those values are the source information for the weapons.
I still think it would be better to have a Weapons table or WeaponDefaults table so these are in the database. Right now, you might think the defaults are only used in one place. You would be surprised how software can grow. Also, having them in the database makes the values more maintainable.
I would have to agree #Gordon_Linoff.
I Don't think you will end up with "way more tables", maybe one or two. If you had a table that had fields of ID, Weapon, Min, Max ...
Then you could do a recordset search when needed. As you said, these variables might never change but changing them in a single spot, seems much more Admin-Friendly then scouring code that you have let alone for a long time. My Two cents worth.

generating custom uniqueidentifier sql server

im using sql server 2012, is it possible to generate a uniqueidentifier value based on two or three values mostly varchars or decimal, i mean any data type which takes 0-9 and a-z.
Usually uniqueidentifier varies from system to system. For my requirement, I need a custom one, when ever i call this function, it should get me the same value in all the systems.
I have been thinking of converting the values into varbinary and taking certain parts of it and generating a uniqueidentifier. How good is this approach.
Im still working on this approach.
Please provide your suggestions.
What you describe is a hash of the values. Use HASHBYTES function to digest your values into a hash. But your definition of the problem contradicts the requirement for uniqueness since, by definition, reducing an input of size M to a hash of size N, where N < M, may generate collisions. If you truly need uniqueness then redefine the requirements in a manner which would at least allow for uniqueness. Namely, the requirement for it should get me the same value in all the systems must be dropped since the only way to guarantee it is to output exactly the input. If you remove this requirement then the new requirement are satisfied by NEWID() (yes, it does not consider the input, but it doesn't have to in order to meet your requirements).
The standards document for Uniqueidentifier goes to some length showing how they are generated. http://www.ietf.org/rfc/rfc4122.txt
I would give this a read (especially 4.1.2. as that breaks down how a guid should be generated) and maybe I would keep use the timestamp components but hard code your network location element which will give you what you are looking for.

Database optimization: What's faster searching by integers OR short strings?

I am wondering about a basic database design / data type question I am having.
I have a porjects table with a field called "experience_required". I know this field will be always populated from one of these options: intern, junior, senior, or director. This list may vary a bit as time evolves but I don't expect dramatic changes to the items on it.
Should I go for integer or string? In the future when I have tons of records like this and need to retrieve them by expeirence_required, will it make a difference to have them in integers?
You may like this field indexed. Once indexed Integer and small Char String don't have much (read negligible) performance difference.
Definitely go for Integer over String.
Performance will be better, and your database will be closer to being normalized.
Ultimately, you should create a new table called ExperienceLevel, with fields Id and Title. The experience_required field in the existing table should be changed to a foreign key on the other table.
This will be a much stronger design, and will be more forgiving in the case that you change the experience levels available, or decide to rename an experience level.
You can read more about Normalization here.
Integers. Strings should IMHO only be used to store textual data (names, addresses, text, etc).
Besides, integers are in this case better for sorting, storage space and maintaining.
In theory integers will take less memory when you index them.
You can also use enums (in mysql) which look like strings but stored as integers.
Doesn't matter. The difference would be negligible. What difference there is would favor the choice of integer, but this is one of the few cases in which I prefer a short text key since it will save a JOIN back to a lookup table in many reporting situations.
To muddy the waters some, I'll suggest a mix. Start with #GregSansom's idea (upvoted), but instead of integers use the CHAR(1) datatype, with values I, J, S, and D. This will give you the same performance as using tinyint, and give the extra advantage of a simple to remember mnemonic when (if) working directly with the data. With a bit of use, it is trivial to remember that "S" means "senior", whereas 3 does not carry any built in meaning--particularly if, as you suggest, extra values are added over time. (Add Probationary as, say, 5, and the "low rank = low value" paradigm is out the window.)
This only works if you have a very short list of items. Get too many or too similar, and it's hard to work up usable codes.
Of course, what if these are sequential values? Sure sounds like it here. In that case, don't make them 1,2,3,4, make them 10, 20, 30, 40, so you can insert new categorizations later on. This would also allow you to easily implement ranges, such as "everyone < 30" (meaning less than "senior").
I guess my main point is: know your data, how it will be used, how it may or will change over time, and plan and code accordingly!

How to Handle Unknown Data Type in one Table

I have a situation where I need to store a general piece of data (could be an int, float, or string) in my database, but I don't know ahead of time which it will be. I need a table (or less preferably tables) to store this unknown typed data.
What I think I am going to do is have a column for each data type, only use one for each record and leave the others NULL. This requires some logic above the database, but this is not too much of a problem because I will be representing these records in models anyway.
Basically, is there a best practice way to do something like this? I have not come up with anything that is less of a hack than this, but it seems like this is a somewhat common problem. Thanks in advance.
EDIT: Also, is this considered 3NF?
You could easily do that if you used SQLite as a database backend :
Any column in a version 3 database, except an INTEGER PRIMARY KEY column, may be used to store any type of value.
For other RDBMS systems, I would go with Philip's solution.
Note that in my line of software (business applications), I cannot think of any situation where this kind of requirement would be needed (a value with an unknown datatype). Unless the domain model was flawed, of course... I can imagine that other lines of software may incur different practices, but I suggest that you consider rethinking your overall design.
If your application can reliably convert datatypes, you might consider a single column solution based on a variable-length binary column, with a second column to track original data type. (I did a very small routine based on this once before, and it worked well enough.) Testing would show if conversion is more efficiently handled on the application or database side.
If I were to do this I would choose either your method, or I would cast everything to string and use only one column. Of course there would be another column with the type (which would probably be useful for the first method too).
For faster code I would probably go with your method.

Generating surrogate keys remotely

Sorry in advance as this question is similar (but not the same!) to others.
Anyway, I need to be able to generate surrogate keys in more than one location to be synchronized at a later time. I was considering using GUIDs, however these keys may have to appear in the parameters of a URL and GUIDs would be really complicated and ugly.
I was considering a scheme that would allow me to use integers, providing better performance in the database, but obviously I cannot simply use auto numbers. The idea is to use a key with two meanings - the High-Low Strategy as I believe it is called. The key would consist of the source (where it was generated, generally 1 of 2 locations in this business case) and the auto incremented value. For instance:
1-000000567,
1-000000568,
1-000000569,
1-000000570,
...
And for another source:
2-000000567,
2-000000567,
...
This would also mean that I could store them in the database as integers (i.e "2-000000567" would become the integer "2000000567").
Can anyone see any issues with this? Such as indexing or fragmentations that may occur? Or perhaps even a better way of doing it?
Just to confirm, there is no business meaning in this key, the user will never see it (except perhaps in the parameters of a URL) nor use it.
I look forward to your opinions and appreciate your time, Thanks a million :)
This explains the hilo algorithm which you refer to: What's the Hi/Lo algorithm?
It's the often-used solution to "disconnect" problems such as yours. For i.e. if you're using Hibernate/nHibernate, it's one of the recommended primary key options.

Resources