I'm concern about performance, engineering and readability. Let's say I have a blog, and every post has its status: published (4), pending review (2), draft (1). What is the recommended to store these information in the status column?
status <======= storing status as string
========
pending
published
draft
status <======= storing status as integer
========
2
4
1
Also, if we should store integer, should we refrain from storing running integer: 1, 2, 3, 4, 5, as opposed to storing a ^2 integer: 2, 4, 8, 16, 32?
Many thanks.
I think your best bet for faster performance, less storage space, and readability is to use CHAR(1)--(p)ublished, pending (r)eview, and (d)raft. You can validate that data with either a CHECK constraint or a foreign key reference.
CHAR(1) takes substantially less space than an integer. It's directly readable by humans, so it doesn't need a join to understand it. Since it's both smaller and immediately readable, you'll get faster retrieval than a join on an integer even on a table of tens of millions of rows.
Storing as a string:
wastes space
takes longer to read/write
is more difficult to index/search
makes it more difficult to guarantee validity (there's nothing to prevent someone inserting arbitrary strings)
Ideally, you should use an enum type for this sort of thing, if your database supports it.
I think the option you choose should depend on how well the tools/frameworks you use work with each feature.
Many database/ORMs deal poorly with enums, requiring custom code (don't understand the concept of "enumerated type").
That said... probably I'd use strings.
Strings:
use more space but in your case names are short and you can easily read a data-dump without the enum-table legend. Nowadays, for a blog / CMS, storage is hardly a issue
performance differences are usually small
you cannot easily rearrange the members of enum-tables (you've to force the "original" integer values).
Strings are also the choice of some well known CMSs (e.g. Drupal 7).
Of course this is a late answer but it could be useful to other readers.
Storing data in the integer form is always more reliable than the character or string.
Create two tables such as blog_status and blog_details
In the blog_status maintain the master status of blog like you said draft, pending and publish
Table structure of blog_status
Create table blog_status
(
blogstatus_id int,
blogstatus_desc varchar(10),
primary key(blogstatus_id)
)
And then create another table where you want to use the blog_status in this way, you can always improve reuse able and performance of your application
Create table blog_details
(
blog_id int,
blog_title varchar(10),
blog_postingdate datetime,
blog_postbox varchar(max),
blog_status int, ---------------------> This should be your blogstatus_id value
primary key(blog_id)
)
There is no point of use the x^2 expression or formula.
I hope, I have clear your doubt . If you find the answer helpful please mark it as your answer else let me know...
The database theorist in me thinks that you shouldn't use lookup tables for single column attributes because it leads to unnecessary splitting of your data; in other words, you don't need to have a table with two columns (and ID value and an attribute name). However, the DBA in me thinks that for performance reasons, splitting your data is a very valid technique. Indexing, disk footprints, and updates become very easy when using lookups.
I'd probably split it.
Related
I am well aware that if I use a nvarchar field as a primary key, or as a foreign key, that this will add some time and space overhead to the usage of the generated index in the majority (if not all) of cases.
As a general rule, using numeric keys are a good idea but under certain common circumstances (small sets of data for instance) it isn't a problem to use text based keys.
However, I am wondering if anyone could provide rigorous information on whether is it MORE efficient, or at least equal, to use text for database keys rather than numeric values under certain circumstances.
Consider a case where a table contains a short list of records. For our example, we'll say we need 50 records. Each record needs an ID. You could use, generic int (or even smallint) numbers (e.g. [1...50]) OR you could assign meaningful, 2 character values to a char(2) field (e.g. [AL, AK, AZ, AR, ... WI]).
In the above case, we could assume that using a char(2) field is potentially more efficient than using an int key since the char data is 2-bytes, vs. 4-bytes used with a int. Using a smallint field theoretically be just as efficient as the char(2) field and, possibly, a varchar(2) field.
The benefit from using the text based key over the numeric key is that the values are readable, which should make it obvious to many that my list of 50 records is likely a list of US States,
As stated, using keys that are smaller or equal in size of a comparable numeric key should be of similar efficiency. However, depending on the architecture and design of the database engine it is possible that in-practice usage may yield unexpected results.
With that stated, is it ever more, equal or less efficient to use any form of text-based value as a key within SQL Server?
I don't need obsessively thorough research results (though I wouldn't mind it), but I am looking for an answer that goes beyond stating what we would expect from a database.
Definitively, how does efficiency of text-based keys compare to numeric-based keys as the size of the text key increases/decreases?
In most cases considerations driven by the business requirements (use cases) will far outweigh any performance differences between numeric v. text keys. Unless you are looking at very large and/or very high throughput systems your choice of primary key type should be based on how the keys will be used rather than any small difference in performance you will see between numeric and text keys.
Think in assembly to find out the answer. You stated this:
we could assume that using a char(2) field is potentially more efficient than using an int key since the char data is 2-bytes, vs. 4-bytes used with a int. Using a smallint field theoretically be just as efficient as the char(2) field and, possibly, a varchar(2) field.
This isn't true, as you can't move 2 characters simultaneously in a single instruction (to my knowledge). So even as a char is smaller than a 4-byte int, you have to move them one-by-one into the register to do a comparison. To compare two instances of a 4-byte int, even if it is larger in size, you only need 1 move instruction per int (disregarding that you also need to move them out of the register back into the memory).
So what happens if you use an int:
Move one of them into one register
Move the other into another
Do a comparison operation
Move to appropriate memory location depending on the comparison result
In the case of a char, however:
Move one of them into one register
Move the other into another
Do a comparison
If you are lucky, and the order can be determined, then done, and the cost is the same as that in the case of ints.
If they are equal, rinse and repeat using the subsequent characters until the order or equality can be determined. Obviously, this is more costly.
Point is that on low level, the determining factor is not the data size in this case but the number of instructions needed.
Apart from the low-level stuff:
Yes, there might be cases where it simply doesn't matter because of the small amount of data that are not likely to ever change - chemical symbols of primitive elements for example (though I am not sure whether I'd use them as PKs).
Generally, you don't use artificial PKs for time and space considerations, but because if they don't have anything to do with in-real-life stuff, they are not subject of change. Can you imagine that the name of a US state ever changes? I can. If it happens, you would have to update the record itself (if the abbreviation changes too, ofc.), and all other records that reference it. If you use an int instead, then your record will have nothing to do with what happens in reality, in which case you only have to update the abbreviation and the state name itself and you can sit back assured that everything is consistent.
Comparing short strings is not always as trivial as comparing the numeric value of their binary representations. When you also have to consider internationalization, you need to rely on custom (or framework/platform-provided) logic to compare them. To use my language as an example, the letter 'Á' has a decimal value of 193, which is greater than the value of 66 of letter 'B', yet, in the Hungarian alphabet, 'Á' preceedes 'B'.
Using textual data rather than an arificial numeric PK can also cause some fragmentation and the write operations are likely to be slower. The reason for this is that an artificial, monotonously increasing numeric PK will cause your newly created rows to be inserted to the end of the table in all cases thereby avoiding the need to "move stuff around to free up space in between".
I'm not that experienced with databases. If I have a database table containing a lot of empty cells, what's the best way to leave them (e.g. so performance isn't degraded, memory is not consumed, if this is even possible)?
I know there's a "null" value. Is there a "none" value or equivalent that has no drawbacks? Or by just not filling the cell, it's considered empty, so there's nothing left to do? Sorry if it's silly question. Sometimes you don't know what you don't know...
Not trying to get into a discussion of normalizing the database. Just wondering what the conventional wisdom is for blank/empty/none cells.
Thanks
The convention is to use null to signify a missing value. That's the purpose of null in SQL.
Noted database researcher C. J. Date writes frequently about his objections to the handling of null in SQL at a logical level, and he would say any column that may be missing belongs in a separate table, so that the absence of a row corresponds to a missing value.
I'm not aware of any serious efficiency drawbacks of using null. Efficiency of any features depend on the specific database implementation you use. You haven't said if you use MySQL, Oracle, Microsoft SQL Server, or other.
MySQL's InnoDB storage engine, for example, doesn't store nulls among the columns of a row, it just stores the non-null columns. Other databases may do this differently. Likewise nulls in indexes should be handled efficiently, but it varies from product to product.
Use NULL. That's what it's for.
Normally databases are said to have rows and columns. If the column does not require a value, it holds nothing (aka NULL) until it is updated with a value. That is best practice for most databases, though not all databases have the NULL value--some use an empty string, but they are the exception.
With regard to space utilization -- disk is relative inexpensive these days, so worries about space consumption are no longer as prevalent as they once used to be, except in gargantuan databases, perhaps. You can get better performance out of a database if you use all fixed-size datatypes, but once you start allowing variable sized string (e.g. varchar, nvarchar) types, that optimization is no longer possible.
In brief, don't worry about performance for the time being, at least until you get your feet wet.
It is possible, but consider:
Are they supposed to be not-empty? Should you implement not null?
Is it a workflow -- so they are empty now, but most of them will be filled in the future?
If both are NO, then you may consider re-design. Edit your question and post the schema you have now.
There are several schools of thought in this. The first is to use null when the data is not known - that's what it's for.
The second is to not allow nulls and either separate out all the fields that could be null to relational tables or to create "fake" values to replace null. For varchar this would usually be the empty string but the problem arises as to what should be the fake value for a date field or or an numeric. Then you have to write code to exclude the fake data just like you have to write code to deal with the nulls.
Personally I prefer to use nulls with some judicious moving of data to child tables if the data is truly a different entity (and often these fields turn out to need the one-to-many structure of a parent-child relationship anyway, such as when you may or may not know the phone number of a person, put it in a separate phone table and then you will often discover you needed to store multiple phone numbers anyway).
I am wondering about a basic database design / data type question I am having.
I have a porjects table with a field called "experience_required". I know this field will be always populated from one of these options: intern, junior, senior, or director. This list may vary a bit as time evolves but I don't expect dramatic changes to the items on it.
Should I go for integer or string? In the future when I have tons of records like this and need to retrieve them by expeirence_required, will it make a difference to have them in integers?
You may like this field indexed. Once indexed Integer and small Char String don't have much (read negligible) performance difference.
Definitely go for Integer over String.
Performance will be better, and your database will be closer to being normalized.
Ultimately, you should create a new table called ExperienceLevel, with fields Id and Title. The experience_required field in the existing table should be changed to a foreign key on the other table.
This will be a much stronger design, and will be more forgiving in the case that you change the experience levels available, or decide to rename an experience level.
You can read more about Normalization here.
Integers. Strings should IMHO only be used to store textual data (names, addresses, text, etc).
Besides, integers are in this case better for sorting, storage space and maintaining.
In theory integers will take less memory when you index them.
You can also use enums (in mysql) which look like strings but stored as integers.
Doesn't matter. The difference would be negligible. What difference there is would favor the choice of integer, but this is one of the few cases in which I prefer a short text key since it will save a JOIN back to a lookup table in many reporting situations.
To muddy the waters some, I'll suggest a mix. Start with #GregSansom's idea (upvoted), but instead of integers use the CHAR(1) datatype, with values I, J, S, and D. This will give you the same performance as using tinyint, and give the extra advantage of a simple to remember mnemonic when (if) working directly with the data. With a bit of use, it is trivial to remember that "S" means "senior", whereas 3 does not carry any built in meaning--particularly if, as you suggest, extra values are added over time. (Add Probationary as, say, 5, and the "low rank = low value" paradigm is out the window.)
This only works if you have a very short list of items. Get too many or too similar, and it's hard to work up usable codes.
Of course, what if these are sequential values? Sure sounds like it here. In that case, don't make them 1,2,3,4, make them 10, 20, 30, 40, so you can insert new categorizations later on. This would also allow you to easily implement ranges, such as "everyone < 30" (meaning less than "senior").
I guess my main point is: know your data, how it will be used, how it may or will change over time, and plan and code accordingly!
I have a database with a field that holds permit numbers associated with requests. The permit numbers are 13 digits, but a permit may not be issued.
With that said, I currently have the field defined as a char(13) that allows NULLs. I have been asked to change it to varchar(13) because char's, if NULL, still use the full length.
Is this advisable? Other than space usage, are there any other advantages or disadvantages to this?
I know in an ideal relational system, the permit numbers would be stored in another related table to avoid the use of NULLs, but it is what it is.
Well, if you don't have to use as much space, then you can fit more pages in memory. If you can do that, then your system will run faster. This may seem trivial, but I just recently tweaked the data types on a a table at a client that reduced the amount of reads by 25% and the CPU by about 20%.
As for which is easier to work with, the benefits David Stratton mentioned are noteworthy. I hate having to use trim functions in string building.
If the field should always be exactly 13 characters, then I'd probably leave it as CHAR(13).
Also, an interesting note from BOL:
If SET ANSI_PADDING is OFF when either
CREATE TABLE or ALTER TABLE is
executed, a char column that is
defined as NULL is handled as varchar.
Edit: How frequently would you expect the field to be NULL? If it will be populated 95% of the time, it's hardly worth it to make this change.
The biggest advantage (in general, not necessarily your specific case) I know of is that in code, if you use varchar, you don't have to use a Trim function every time you want it displayed. I run into this a lot when taking FirstName fields and LastName fields and combining them into a FullName. It's just annoying and makes the code less readable.
if your are using sql server 2008, you should look at Row Compression and perhaps sparse fields if the column is more ~60% nulls.
I would keep the datatype a char(13) if all of the populated fields use that amount.
Row Compression Information:
http://msdn.microsoft.com/en-us/library/cc280449.aspx
Sparse columns:
http://msdn.microsoft.com/en-us/library/cc280604.aspx
Take the following create table statement:
create table fruit
{
count int,
name varchar(32),
size float
}
Instead of those specific data types, why not have "string", "number", "boolean" or better yet, not having to specify any data types at all.
What are the technical reasons for having such specific data types? (as opposed to generic or no data type)
Imagine 20 millions rows in a table, with an int column where all the numbers are 1 through 10.
If you used a tinyint for that, it would take 1 byte. If you used a regular int, it would take 4 bytes. That's four times the amount of disk space, 60 MBs more disk space.
Theoretically, you could design a database engine to "smart config" a table, but imagine our theoretical table where all of a sudden the database decides it need to allocate more bytes for the data in the column. The whole table would need to be re-paged, and the performance would slow to a crawl for potentially hours while the engine restructured the table.
There are so many edge cases and ways to get it wrong, that it would be more work to stay on top of automatic configuration than to just design your application properly in the first place.
It sets a strategy for sorting and indexing, as well as enforce data integrity.
Imagine this.
MyNumberField as generic: "1234", 13, 35, "1234afgas"
Why are some of those strings and why is there letters in "1234afgas"?
With the type constraints those wouldn't be allowed.
because there is a different in size and storage
tinyint = 1 byte
smallint = 2 bytes
int = 4 bytes
bigint = 8 bytes
so if you know you only need to store up to a certain range there is no need to use bigint and incur overhead of storing extra bytes per row
same holds for strings (char, varchar etc etc)
also built in constraints...can't store the letter A in an int...data will be clean..
Not only are you telling the database system how you are going to use the data: string, boolean, number. You are also telling the database which internal representation to use. This is important for space, indexing, and performance reasons.
To add to what everyone else has posted there is also a huge issue with data integrity. Imagine you stored the value "1" into the database, should this be treated as TRUE, a numeric value of 1, a string "1"...
if two columns have a value of "1", does col1 + col2 equal numeric 2 or string "11"?
Aside to what's already been said, there are databases that do not require data types, such as SQLite (http://www.sqlite.org/).
There are databases out there that do not type, the one that comes to mind is IBM's Universe DB (aka Pick). With that Db all fields are of the string type and you define how they are used via a "dictionary".
Having used both strongly typed DB's and Universe extensively, I'm partial to the strongly typed ones froma programming standpoint.
The same basis of question could be asked of any type anywhere. Why have types in classes? It is a limitation and expectation of data. You expect to get x type so you can deal with x type. You don't want to deal with the infinite possibility and do lots of type checking every time you deal with a piece of data.
The types whether primitive or created type are there to define the structure that is being held. It is saying that N is a type X and you can do all the things that type X can do.
You are saying, for instance, I am dealing with an integer than can be in a certain range of numbers -X to X vs a big integer which can be in a larger range of numbers -Z to Z. (as a specific example). Usage expectations will fall in those ranges.
You also, as others have mentioned, defining how to store the information at a lower level. Event saying you have an integer is somewhat of an abstraction away from the machine.
Aside from storage, a certain datatype is also a type of constraint
If you know for instance a certain account number will hold exactly 8 chars, defining that in the type is the most logical and performant thing you can do. (nchar(8) for example)
You are setting the domain (or a part of it, it can be further refined by other constraints) immediately in the field's type that way.
One of the primary functions of a database is to be able to perform operations on huge amounts of data efficiently. Being very specific about data types increases the number of things that the database engine can assume about the data it's storing. Therefore, it has to perform fewer calculations and runs faster, and it can avoid allocating storage that it won't need which makes the database smaller and therefore faster still.
One of the other primary functions of a database is to ensure data integrity. The more exactly you specify what sort of data should be stored in a field, the less likely you are to accidentally store the wrong data there. It's analogous to why your C compiler is so picky about the code you write: you should much prefer to deal with compile-time errors than run-time errors.