Database optimization: What's faster searching by integers OR short strings? - database

I am wondering about a basic database design / data type question I am having.
I have a porjects table with a field called "experience_required". I know this field will be always populated from one of these options: intern, junior, senior, or director. This list may vary a bit as time evolves but I don't expect dramatic changes to the items on it.
Should I go for integer or string? In the future when I have tons of records like this and need to retrieve them by expeirence_required, will it make a difference to have them in integers?

You may like this field indexed. Once indexed Integer and small Char String don't have much (read negligible) performance difference.

Definitely go for Integer over String.
Performance will be better, and your database will be closer to being normalized.
Ultimately, you should create a new table called ExperienceLevel, with fields Id and Title. The experience_required field in the existing table should be changed to a foreign key on the other table.
This will be a much stronger design, and will be more forgiving in the case that you change the experience levels available, or decide to rename an experience level.
You can read more about Normalization here.

Integers. Strings should IMHO only be used to store textual data (names, addresses, text, etc).
Besides, integers are in this case better for sorting, storage space and maintaining.

In theory integers will take less memory when you index them.
You can also use enums (in mysql) which look like strings but stored as integers.

Doesn't matter. The difference would be negligible. What difference there is would favor the choice of integer, but this is one of the few cases in which I prefer a short text key since it will save a JOIN back to a lookup table in many reporting situations.

To muddy the waters some, I'll suggest a mix. Start with #GregSansom's idea (upvoted), but instead of integers use the CHAR(1) datatype, with values I, J, S, and D. This will give you the same performance as using tinyint, and give the extra advantage of a simple to remember mnemonic when (if) working directly with the data. With a bit of use, it is trivial to remember that "S" means "senior", whereas 3 does not carry any built in meaning--particularly if, as you suggest, extra values are added over time. (Add Probationary as, say, 5, and the "low rank = low value" paradigm is out the window.)
This only works if you have a very short list of items. Get too many or too similar, and it's hard to work up usable codes.
Of course, what if these are sequential values? Sure sounds like it here. In that case, don't make them 1,2,3,4, make them 10, 20, 30, 40, so you can insert new categorizations later on. This would also allow you to easily implement ranges, such as "everyone < 30" (meaning less than "senior").
I guess my main point is: know your data, how it will be used, how it may or will change over time, and plan and code accordingly!

Related

what data type should I use for ids on database?

I saw many debates and articles as to which of integer(increment) and uuid should be used for ids on database.
There introduced some pros and cons of both the integer and uuid.
For example,
integer: fast, but available size is limited(unless you use bigint)
uuid: very unique and much more secure, but slow, and storage-
consuming
Then, I wondered if using random strings length of around 10( varchar(10) ), comprised of upper and lower case letters, and integers would solve the problems because they are not so big in size and can cover wide range of data(62^10 ways if 10 chars).
So, my question is, Is it good or bad to do that?
There is no absolute bad or good when it comes to database design. You should design your database based on your needs.
You mentioned some pros and cons of using int and uuid and now i recommend you to list your needs so you can choose which one to use.
Also keep in mind that you can use some tricks to get around the limits of both ints and uuids.
For example if uuid seems the right option for you but the speed of looking them up in the database is bothering you, then you can simply use indexing to maximize the speed for uuids. and if you have many writes and you need them to be fast, you can use pre-generated uuids. (generate some uuids, index them, and pick one of them up each time you need to)
And for ints, you can simply use 2 ints as your id which both of them together will make the id or some other math algorithm that make it a little more secure but yet fast enough.
These are just two example of how you can optimize your system so it will be fast enough and yet answering to your needs in the best way possible.
And for the case that it is okay to use both ints and uuids in your database design: it is completely ok if it's the best way of doing it for both satisfying your needs and getting the best performance out of it.

SQL Server CHECKSUM function issue

Can anyone explain to me, and maybe propose a better approach.
Why is checksum(0.0280) = checksum(-0.0280) ?
Casting to float would solve it, but I’m reluctant to do it, and I would rather find a way around this.
LE: I was trying to keep things simple, as with most questions around here, this is something that has come up in production, and putting the entire database structure is a bit of an overkill.
I will try to explain it a bit better. I have some dynamic structure tables (dynamic in the sense that the enduser controls the structure through a web application) that have the following rough structure: Id (int), StartDate, FKey1 (nvarchar), Value1 (decimal or nvarchar or int), Value2 ... ValueN.
This tables can be filled (again, by the end user) with redundant data (millions of rows) and during some calculations I would like to declutter this table leaving only relevant information. The way to declutter it, is to remove consecutive identical rows (except for the date). For the sake of performance I wanted to avoid checking each column individually, so CHECKSUM came in handy because it also supports multiple columns as input.
If you were thinking that there is one and only one possible value for every possible CHECKSUM, you were mistaken.
From the documentation:
If at least one of the values in the expression list changes, the list
checksum will probably change. However, this is not guaranteed.
Therefore, to detect whether values have changed, we recommend use of
CHECKSUM only if your application can tolerate an occasional missed
change. Otherwise, consider using HashBytes instead. With a specified MD5 hash algorithm, the probability that HashBytes will return the same result, for two different inputs, is much lower compared to CHECKSUM.
If you want to research it further, you might Google CHECKSUM collisions.
With a hashing function (like CHECKSUM) there will always be the risk of collisions.
You can try another (slower) hash function (like HashBytes as mentioned by #TabAlleman) or you can try out some homemade attempts that might perform better than HashBytes (but this should be tested), and that fits better to your anticipation of what numbers you expect coming in. So this is a trade-off: Performance versus collision risk. Here are 2 such homemade attempts that will give a different result for numbers that are equal except for the sign. Please notice that these variants will also produce collisions, but most likely for other differences than simply their sign.
select checksum(.028, floor(.28))
select checksum(-.028, floor(-.28))
select checksum(.028) + sign(.28)
select checksum(-.028) + sign(-.28)
When you said you could solve it by casting to a float, but still did not want to do that, I wonder if that was out of performance considerations. If so, I'm not sure my variants will perform better than casting to a float. Have a go at measuring that yourself :-)

Should I store this in the database or in the code?

I'm creating a small game composed of weapons. Weapons have characteristics, like the accuracy. When a player crafts such a weapon, a value between min and max are generated for each characteristic. For example, the accuracy of a new gun is a number between 2 and 5.
My question is... should I store the minimum and maximum value in the database or should it be hard coded in the code ?
I understand that putting them in the database allows me to change these values easily, however these won't change very often and doing this mean having to make a database request when I need these values. Moreover, its means having way much more tables... however, is it a good practice to store this directly in the code ?
In conclusion, I really don't know what solution to chose as both have advantages and disadvantage.
If you have attributes of an entity, then you should store them in the database.
That is what databases are for, storing data. I can see no advantage to hardcoding such values. Worse, the values might be used in different places in your code. And, when you update them, you might end up with inconsistent values throughout the code.
EDIT:
If these are default values, then I can imagine storing them in the code along with all the other information about the weapon -- name of the weapon, category, and so on. Those values are the source information for the weapons.
I still think it would be better to have a Weapons table or WeaponDefaults table so these are in the database. Right now, you might think the defaults are only used in one place. You would be surprised how software can grow. Also, having them in the database makes the values more maintainable.
I would have to agree #Gordon_Linoff.
I Don't think you will end up with "way more tables", maybe one or two. If you had a table that had fields of ID, Weapon, Min, Max ...
Then you could do a recordset search when needed. As you said, these variables might never change but changing them in a single spot, seems much more Admin-Friendly then scouring code that you have let alone for a long time. My Two cents worth.

Difference between storing integer or string in database table

I'm concern about performance, engineering and readability. Let's say I have a blog, and every post has its status: published (4), pending review (2), draft (1). What is the recommended to store these information in the status column?
status <======= storing status as string
========
pending
published
draft
status <======= storing status as integer
========
2
4
1
Also, if we should store integer, should we refrain from storing running integer: 1, 2, 3, 4, 5, as opposed to storing a ^2 integer: 2, 4, 8, 16, 32?
Many thanks.
I think your best bet for faster performance, less storage space, and readability is to use CHAR(1)--(p)ublished, pending (r)eview, and (d)raft. You can validate that data with either a CHECK constraint or a foreign key reference.
CHAR(1) takes substantially less space than an integer. It's directly readable by humans, so it doesn't need a join to understand it. Since it's both smaller and immediately readable, you'll get faster retrieval than a join on an integer even on a table of tens of millions of rows.
Storing as a string:
wastes space
takes longer to read/write
is more difficult to index/search
makes it more difficult to guarantee validity (there's nothing to prevent someone inserting arbitrary strings)
Ideally, you should use an enum type for this sort of thing, if your database supports it.
I think the option you choose should depend on how well the tools/frameworks you use work with each feature.
Many database/ORMs deal poorly with enums, requiring custom code (don't understand the concept of "enumerated type").
That said... probably I'd use strings.
Strings:
use more space but in your case names are short and you can easily read a data-dump without the enum-table legend. Nowadays, for a blog / CMS, storage is hardly a issue
performance differences are usually small
you cannot easily rearrange the members of enum-tables (you've to force the "original" integer values).
Strings are also the choice of some well known CMSs (e.g. Drupal 7).
Of course this is a late answer but it could be useful to other readers.
Storing data in the integer form is always more reliable than the character or string.
Create two tables such as blog_status and blog_details
In the blog_status maintain the master status of blog like you said draft, pending and publish
Table structure of blog_status
Create table blog_status
(
blogstatus_id int,
blogstatus_desc varchar(10),
primary key(blogstatus_id)
)
And then create another table where you want to use the blog_status in this way, you can always improve reuse able and performance of your application
Create table blog_details
(
blog_id int,
blog_title varchar(10),
blog_postingdate datetime,
blog_postbox varchar(max),
blog_status int, ---------------------> This should be your blogstatus_id value
primary key(blog_id)
)
There is no point of use the x^2 expression or formula.
I hope, I have clear your doubt . If you find the answer helpful please mark it as your answer else let me know...
The database theorist in me thinks that you shouldn't use lookup tables for single column attributes because it leads to unnecessary splitting of your data; in other words, you don't need to have a table with two columns (and ID value and an attribute name). However, the DBA in me thinks that for performance reasons, splitting your data is a very valid technique. Indexing, disk footprints, and updates become very easy when using lookups.
I'd probably split it.

Naming conventions for non-normalized fields

Is it a common practice to use special naming conventions when you're denormalizing for performance?
For example, let's say you have a customer table with a date_of_birth column. You might then add an age_range column because sometimes it's too expensive to calculate that customer's age range on the fly. However, one could see this getting messy because it's not abundantly clear which values are authoritative and which ones are derived. So maybe you'd want to name that column denormalized_age_range or something.
Is it common to use a special naming convention for these columns? If so, are there established naming conventions for such a thing?
Edit: Here's another, more realistic example of when denormalization would give you a performance gain. This is from a real-life case. Let's say you're writing an app that keeps track of college courses at all the colleges in the US. You need to be able to show, for each degree, how many credits you graduate with if you choose that degree. A degree's credit count is actually ridiculously complicated to calculate and it takes a long time (more than one second per degree). If you have a report comparing 100 different degrees, it wouldn't be practical to calculate the credit count on the fly. What I did when I came across this problem was I added a credit_count column to our degree table and calculated each degree's credit count up front. This solved the performance problem.
I've seen column names use the word "derived" when they represent that kind of value. I haven't seen a generic style guide for other kinds of denormalization.
I should add that in every case I've seen, the derived value is always considered secondary to the data from which it is derived.
In some programming languages, eg Java, variable names with the _ prefix are used for private methods or variables. Private means it should not be modified/invoked by any methods outside the class.
I wonder if this convention can be borrowed in naming derived database columns.
In Postgres, column names can start with _, eg _average_product_price.
It can convey the meaning that you can read this column, but don't write it because it's derived.
I'm in the same situation right now, designing a database schema that can benefit from denormalisation of central values. For example, table partitioning requires the partition key to exist in the table. So even if the data can be retrieved by following some levels of foreign keys, I need the data right there in most tables.
Maybe the suffix "copy" could be used for this. Because after all, the data is just a copy of some other location where the primary data is stored. Since it's a word, it can work with all naming conventions, like .NET PascalCase which can be mapped to SQL snake_case, e. g. CompanyIdCopy and company_id_copy. And it's a short word so you don't have to write too much. And it's not an abbreviation so you don't have to spell it or ever wonder what it means. ;-)
I could also think of the suffix "cache" or "cached" but a cache is usually filled on demand and invalidated some time later, which is usually not the case with denormalised columns. That data should exist at all times and never be outdated or missing.
The word "derived" is just a bit longer than "copy". I know that one special DBMS, an expensive one, has a column name limit of 30 characters, so that could be an issue.
If all of the values required to derive the calculation are in the table already, then it is extremely unlikely that you will gain any meaningful (or even measurable) performance benefit by persisting these calculated values.
I realize this doesn't answer the question directly, but it would seem that the premise is faulty: if such conditions existed for the question to apply, then you don't need to denormalize it to begin with.

Resources