One table with redundant data in one column vs Multiple tables - sql-server

I have a scenario wherein I have users uploading their custom preferences to the conversion being done by our utility. So, the problem we are facing is
should we keep these custom mappings in one table with an extra column of UID or maybe GUID
,
or
should we create tables on the fly for every user that registers with us and provide him/her a separate table for their custom mappings altogether?
We're more inclined to having just an extra column and one table for the custom mappings, but does the redundant data in that one column have any effect on performance? On the other hand, logic dictates us to normalize our tables and have separate tables. I'm rather confused as to what to do.

Common practice is to use a single table with a user id. If there are commonly reused preferences, you can instead refactor these out into a separate table and reference them in one go, but that is up to you.
Why is the creating one table per user a bad idea? Well, I don't think you want one million tables in your database if you end up with one million users. Doing it this way will also guarantee your queries will be unwieldy at best, often slower than they ought to be, and often dynamically generated and EXECed - usually a last resort option and not necessarily safe.

I inherited a system where the original developers had chosen to go with the one table per client approach. We are now rewriting the system! From a dev perspective, you would end up with a tangled mess of dynamic SQL and looping. The DBAs will not be happy because they will wake up to an entirely different database every morning, which makes performance tuning and maintenance impossible.
You could have an additional BLOB or XML column in the users table, storing the preferences. This is called the property bag approach, but I would not recommend this either, as it will hamper query performance in many cases.
The best approach in this scenario is to solve the problem with normalisation. A separate table for the properties, joined to the users table with a PK/FK relationship. I would not recommend using a GUID. This GUiD would likely end-up as the PK, meaning you will have a 16-byte Clustered Index, and that will also be carried through to all non-clustered indexes on the table. You would probably also be building a non-cluster index on this in the Users table. Also, you could run the risk of falling into the pitfall of generating these GUIDs with NEW_ID() rather than NEW_SEQUENTIALID() which will lead to massive fragmentation.
If you take the normalisation approach and you have concerns about returning the results across the two tables in a tabular format, then you can simply use a PIVOT operator.

Related

What is the number of columns that make table really big?

I have two tables in my database, one for login and second for user details (the database is not only two tables). Logins table has 12 columns (Id, Email, Password, PhoneNumber ...) and user details has 23 columns (Job, City, Gender, ContactInfo ..). The two tables have one-to-one relationship.
I am thinking to create one table that contain the columns of both tables but I not sure because this may make the size of the table big.
So this lead to my question, what the number of columns that make table big? Is there a certain or approximate number that make size of table big and make us stop adding columns to a table and create another one? or it is up to the programmer to decide such number?
The number of columns isn't realistically a problem. Any kind of performance issues you seem to be worried with can be attributed to the size of the DATA on the table. Ie, if the table has billions of rows, or if one of the columns contains 200 MB of XML data on each separate row, etc.
Normally, the only issue arising from a multitude of columns is how it pertains to indexing, as it can get troublesome trying to create 100 different indexes covering each variation of each query.
Point here is, we can't really give you any advice since just the number of tables and columns and relations isn't enough information to go on. It could be perfectly fine, or not. The nature of the data, and how you account for that data with proper normalization, indexing and statistics, is what really matters.
The constraint that makes us stop adding columns to an existing table in SQL is if we exceed the maximum number of columns that the database engine can support for a single table. As can be seen here, for SQLServer that is 1024 columns for a non-wide table, or 30,000 columns for a wide table.
35 columns is not a particularly large number of columns for a table.
There are a number of reasons why decomposing a table (splitting up by columns) might be advisable. One of the first reasons a beginner should learn is data normalization. Data normalization is not directly concerned with performance, although a normalized database will sometimes outperform a poorly built one, especially under load.
The first three steps in normalization result in 1st, 2nd, and 3rd normal forms. These forms have to do with the relationship that non-key values have to the key. A simple summary is that a table in 3rd normal form is one where all the non-key values are determined by the key, the whole key, and nothing but the key.
There is a whole body of literature out there that will teach you how to normalize, what the benefits of normalization are, and what the drawbacks sometimes are. Once you become proficient in normalization, you may wish to learn when to depart from the normalization rules, and follow a design pattern like Star Schema, which results in a well structured, but not normalized design.
Some people treat normalization like a religion, but that's overselling the idea. It's definitely a good thing to learn, but it's only a set of guidelines that can often (but not always) lead you in the direction of a satisfactory design.
A normalized database tends to outperform a non normalized one at update time, but a denormalized database can be built that is extraordinarily speedy for certain kinds of retrieval.
And, of course, all this depends on how many databases you are going to build, and their size and scope,
I take it that the login tables contains data that is only used when the user logs into your system. For all other purposes, the details table is used.
Separating these sets of data into separate tables is not a bad idea and could work perfectly well for your application. However, another option is having the data in one table and separating them using covering indexes.
One aspect of an index no one seems to consider is that an index can be thought of as a sub-table within a table. When a SQL statement accesses only the fields within an index, the I/O required to perform the operation can be limited to only the index rather than the entire row. So creating a "login" index and "details" index would achieve the same benefits as separate tables. With the added benefit that any operations that do need all the data would not have to perform a join of two tables.

Is it bad to use redundant relationships?

Suppose I have the following tables in my database:
Now all my queries depend on Company table. Is it a bad practice to give every other table a (redundant) relationships to the Company table to simplify my sql queries?
Edit 1: Background is a usage problem with a framework. See Django: limiting model data.
Edit 2: No tuple would change his company.
Edit 3: I don't write the mysql queries. I use a abstraction layer (django).
It is bad practice because your redundant data has to be updated independently and therefore redundantly. A process that is fraught with potential for error. (Even automatic cascading has to be assigned and maintained separately)
By introducing this relation you effectively denormalize your database. Denormalization is sometimes necessary for the sake of performance but from your question it sounds like you're just simplifying your SQL.
Use other mechanisms to abstract the complexity of your database: Views, Stored Procs, UDFs
What you are asking is whether to violate Third Normal Form in your design. Doing so is not something to be done without good reason because by creating redundancy you create the possibility for errors and inconsistencies in your data. Also, "simplifying" the model with redundant data to support some operations is likely to complicate other operations. Also, constraints and other data access logic will likely need to be duplicated unnecessarily.
Is it a bad practice to give every other table a (redundant) relation to the Company table to simplify my sql queries?
Yes, absolutely, as it would mean updating every redundant relation when you update the relations customer to company or section to company -- and if you miss any such update, you now have a database full of redundant data. It's a bad denormalization.
If your point is to just simplify your SQL, consider using views to "bring along" parent data. Here's a view that pulls company_id into contract, by join through customer:
create view contract_customer as
select
a.*,
b.contract_id, b.company_id
from
contract a
join customer b on (a.customer_id = b.customer_id);
This join is simple, but why repeat it over and over? Write it once, and then use the view in other queries.
Many (but not all) RDBMSes can even optimize out the join if you don't put any columns from customer in the select list or where clause of the query based on the view, as long as you make contract.customer_id have a foreign key referential integrity constraint on customer.customer_id. (In the absence of such a constraint, the join can't be omitted, because it would then be possible for a contract.customer_id to exist which did not exist in customer. Since you'll never want that, you'll add the foreign key constraint.)
Using the view achieves what you want, without the time overhead of having to update the child tables, without the space overhead of making child rows wider by adding the redundant column (and this really begins to matter when you have many rows, as the wider the row, the fewer rows can fit into memory at once), and most importantly, without the possibility of inconsistent data when the parent is updated but the children are not.
If you really need to simplify things, this is where a View (or multiple views) would come in handy.
Having a column for the company in your employee view would not be poorly normalized providing it is derived from a join on section.
If you mean add a Company column to every table, it's a bad idea. It'll increase the potential for data integrity issues (i.e. it gets changed in one table but not the other 6 where it should).
I'd say not in the OP's case, but sometimes it's useful (just like goto ;).
An anecdote:
I'm working with a database where most tables have a foreign key pointing to a root table for the accounts. The account numbers are external to the database and aren't allowed to be changed once issued. So there is no danger of changing the account numbers and failing to update all references in the DB. I also find that it is also considerably easier to grab data from tables keyed by account number instead of having to do complex and costly joins up the hierarchy to get to the root account table. But in my case, we don't have so much a foreign key as an external (i.e., real world) identifier, so it's not quite the same as the OP's situation and seems suitable for an exception.
That depends on your functional requirements for 'Performance'. Is your application going to handle heavy demand? Simplifying JOINS boasts performance. Besides hardware is cheap and turn-around time is important.
The more deeper you go in database normal forms - you save space but heavy on computation

How efficient is a details table?

At my job, we have pseudo-standard of creating one table to hold the "standard" information for an entity, and a second table, named like 'TableNameDetails', which holds optional data elements. On average, for every row in the main table will have about 8-10 detail rows in it.
My question is: What kind of performance impacts does this have over adding these details as additional nullable columns on the main table?
8-10 detail rows or 8-10 detail columns ?
If its rows, then you're mixing apples and oranges as a one-to-many relationship cannot be flatten out into columns.
If is columns, then you're talking vertical partitioning. For large and very large tables, moving seldom referenced columns into Extra or Details tables (ie partition the columns vertically into 'hot' and 'cold' tables) can have significant and event huge performance benefits. Narrower table means higher density of data per page, in turn means less pages needed for frequent queries, less IO, better cache efficiency, all goodness.
Mileage may vary, depending on the average width of the 'details' columns and how 'seldom' the columns are accessed.
I'm with Remus on all the "depends", but would just add that after choosing this design for a table/entity, you must also have a good process for determining what is "standard" and what is "details" for an entity.
Misplacing something as a detail which should be standard is probably the worst thing. Because you can't require a row to exist as easily as requiring a column to exist (big complex trigger code). Setting a default on a type of row is a lot harder (big complex constraint code). And indexing is also not easy either (sparse index, maybe?).
Misplacing something as a standard which should be a detail is less of a mistake, just taking up extra row space and potentially not being able to have a meaningful default.
If your details are very weakly structured, you could consider using an XML column for the "details" and still be able to query them using XPath/XQuery.
As a general rule, I would not use this pattern for every entity table, but only entity tables which have certain requirements and usage patterns which fit this solution's benefits well.
Is your details table an entity value table? In that case, yes you are asking for performance problems.
What you are describing is an Entity-Attribute-Value design. They have their place in the world, but they should be avoided like the plague unless absolutely necessary. The analogy I always give is that they are like drugs: in small quantities and in select circumstances they can be beneficial. Too much will kill you. Their performance will be awful and will not scale and you will not get any sort of data integrity on the values since they are all stored as strings.
So, the short answer to your question: if you never need to query for specific values nor ever need to make a columnar report of a given entity's attributes nor care about data integrity nor ever do anything other than spit the entire wad of data for an entity out as a list, they're fine. If you need to actually use them however, whatever query you write will not be efficient.
You are mixing two different data models - a domain specific one for the "standard" and a key/value one for the "extended" information.
I dislike key/value tables except when absolutely required. They run counter to the concept of an SQL database and generally represent an attempt to shoehorn object data into a data store that can't conveniently handle it.
If some of the extended information is very often NULL you can split that column off into a separate table. But if you do this to two different columns, put them in separate tables, not the same table.

what type of database record id to use: long or guid?

In recent years I was using MSSQL databases, and all unique records in tables has the ID column type of bigint (long). It is autoincrementing and generally - works fine.
Currently I am observing people prefer to use GUIDs for record's identity.
Does it make sense to swap bigint to guid for unique record id?
I think it doesn't make sense as generating bigint as well as sorting would be always faster than guid, but... some troubles come when using two (or more) separated instances of application and database and keep them in sync, so you have to manage id pools between sql servers (for example: sql1 uses id's from 100 to 200, sql2 uses id's from 201 to 300) - this is a thin ice.
With guid id, you don't care about id pools.
What is your advice for my mirrored application (and db): stay with traditional ID's or move to GUIDs?
Thanks in advance for your reply!
guids have the
Advantages:
Being able to create them offline from the database without worrying about collisions.
You're never going to run out of them
Disadvantages:
Sequential inserts can perform poorly (especially on clustered indexes).
Sequential Guids fix this
Take up more space per row
creating one cleanly isn't cheap
but if the clients are generating them this is actually no problem
The column should still have a unique constraint (either as the PK or as a separate constraint if it is part of some other relationship) since there is nothing stopping someone supplying the GUID by hand and accidentally/deliberately violating uniqueness.
If the space doesn't bother you and your performance if not significantly impacted they make a lot of problems go away. The decision is inevitably specific to the individual needs of the application.
I use GUIDs in any scenario that involves either replication or client-side ID generation. It's just so much easier to manage identity in either of those situations with a GUID.
For two-tier scenarios like a web application talking directly to the database, or for servers that don't need to replicate (or perhaps only need to replicate in one direction, pub/sub style) then I think an auto-incrementing ID column is just fine.
As for whether to stay with autoincs or move to GUIDs ... it's one thing to advocate GUIDs in a green-field application where you can make all these decisions up front. It's another to advise someone to migrate an existing database. It might be more pain than it's worth.
GUIDs have issues with performance and concurrency when page splits occur. INTs can run page fill at 100% - only added at one end, GUIDS add everywhere so you probably have to run a lower fill - which wastes space throughout the index.
GUIDS can be allocated in the application, so the App can know the ID of the record it will have created, which can be handy; but, technically, it is possible for duplicate GUIDs to be generated (long odds, but at least put a Unique Index on GUID columns)
I agree for merging databases its easier. But for me a straight INT is better, and then live with the hassle of sorting out how to merge DBs when/if it is actually needed.
If your data move around often, then GUID is the best one for the Key of the table.
If you really care about the performance, just stick to int or bigint
If you want to leverage both of above, use int or bigint as the key of the table and each row can have a rowguid column so that the data can also be moved around easily without losing integrity.
If the ids are going to be displayed in the querystring, use Guids, otherwise use long as a rule.

Reasons not to use an auto-incrementing number for a primary key

I'm currently working on someone else's database where the primary keys are generated via a lookup table which contains a list of table names and the last primary key used. A stored procedure increments this value and checks it is unique before returning it to the calling 'insert' SP.
What are the benefits for using a method like this (or just generating a GUID) instead of just using the Identity/Auto-number?
I'm not talking about primary keys that actually 'mean' something like ISBNs or product codes, just the unique identifiers.
Thanks.
An auto generated ID can cause problems in situations where you are using replication (as I'm sure the techniques you've found can!). In these cases, I generally opt for a GUID.
If you are not likely to use replication, then an auto-incrementing PK will most likely work just fine.
There's nothing inherently wrong with using AutoNumber, but there are a few reasons not to do it. Still, rolling your own solution isn't the best idea, as dacracot mentioned. Let me explain.
The first reason not to use AutoNumber on each table is you may end up merging records from multiple tables. Say you have a Sales Order table and some other kind of order table, and you decide to pull out some common data and use multiple table inheritance. It's nice to have primary keys that are globally unique. This is similar to what bobwienholt said about merging databases, but it can happen within a database.
Second, other databases don't use this paradigm, and other paradigms such as Oracle's sequences are way better. Fortunately, it's possible to mimic Oracle sequences using SQL Server. One way to do this is to create a single AutoNumber table for your entire database, called MainSequence, or whatever. No other table in the database will use autonumber, but anyone that needs a primary key generated automatically will use MainSequence to get it. This way, you get all of the built in performance, locking, thread-safety, etc. that dacracot was talking about without having to build it yourself.
Another option is using GUIDs for primary keys, but I don't recommend that because even if you are sure a human (even a developer) is never going to read them, someone probably will, and it's hard. And more importantly, things implicitly cast to ints very easily in T-SQL but can have a lot of trouble implicitly casting to a GUID. Basically, they are inconvenient.
In building a new system, I'd recommend using a dedicated table for primary key generation (just like Oracle sequences). For an existing database, I wouldn't go out of my way to change it.
from CodingHorror:
GUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes
The article provides a lot of good external links on making the decision on GUID vs. Auto Increment. If I can, I go with GUID.
It's useful for clients to be able to pre-allocate a whole bunch of IDs to do a bulk insert without having to then update their local objects with the inserted IDs. Then there's the whole replication issue, as mentioned by Galwegian.
The procedure method of incrementing must be thread safe. If not, you may not get unique numbers. Also, it must be fast, otherwise it will be an application bottleneck. The built in functions have already taken these two factors into account.
My main issue with auto-incrementing keys is that they lack any meaning
That's a requirement of a primary key, in my mind -- to have no other reason to exist other than identifying a record. If it has no real-world meaning, then it has no real-world reason to change. You don't want primary keys to change, generally speaking, because you have to search-replace your whole database or worse. I have been surprised at the sorts of things I have assumed would be unique and unchanging that have not turned out to be years later.
Here's the thing with auto incrementing integers as keys:
You HAVE to have posted the record before you get access to it. That means that until you have posted the record, you cannot, for example, prepare related records that will be stored in another table, or any one of a lot of other possible reasons why it might be useful to have access to the new record's unique id, before posting it.
The above is my deciding factor, whether to go with one method, or the other.
Using a unique identifiers would allow you to merge data from two different databases.
Maybe you have an application that collects data in multiple database and then "syncs" with a master database at various times in the day. You wouldn't have to worry about primary key collisions in this scenario.
Or, possibly, you might want to know what a record's ID will be before you actually create it.
One benefit is that it can allow the database/SQL to be more cross-platform. The SQL can be exactly the same on SQL Server, Oracle, etc...
The only reason I can think of is that the code was written before sequences were invented and the code forgot to catch up ;)
I would prefer to use a GUID for most of the scenarios in which the post's current method makes any sense to me (replication being a possible one). If replication was the issue, such a stored procedure would have to be aware of the other server which would have to be linked to ensure key uniqueness, which would make it very brittle and probably a poor way of doing this.
One situation where I use integer primary keys that are NOT auto-incrementing identities is the case of rarely-changed lookup tables that enforce foreign key constraints, that will have a corresponding enum in the data-consuming application. In that scenario, I want to ensure the enum mapping will be correct between development and deployment, especially if there will be multiple prod servers.
Another potential reason is that you deliberately want random keys. This can be desirable if, say, you don't want nosey browsers leafing through every item you have in the database, but it's not critical enough to warrant actual authentication security measures.
My main issue with auto-incrementing keys is that they lack any meaning.
For tables where certain fields provide uniqueness (whether alone or in combination with another), I'd opt for using that instead.
A useful side benefit of using a GUID primary key instead of an auto-incrementing one is that you can assign the PK value for a new row on the client side (in fact you have to do this in a replication scenario), sparing you the hassle of retrieving the PK of the row you just added on the server.
One of the downsides of a GUID PK is that joins on a GUID field are slower (unless this has changed recently). Another upside of using GUIDs is that it's fun to try and explain to a non-technical manager why a GUID collision is rather unlikely.
Galwegian's answer is not necessarily true.
With MySQL you can set a key offset for each database instance. If you combine this with a large enough increment it will for fine. I'm sure other vendors would have some sort of similar settings.
Lets say we have 2 databases we want to replicate. We can set it up in the following way.
increment = 2
db1 - offset = 1
db2 - offset = 2
This means that
db1 will have keys 1, 3, 5, 7....
db2 will have keys 2, 4, 6, 8....
Therefore we will not have key clashes on inserts.
The only real reason to do this is to be database agnostic (if different db versions use different auto-numbering techniques).
The other issue mentioned here is the ability to create records in multiple places (like in the central office as well as on traveling users' laptops). In that case, though, you would probably need something like a "sitecode" that was unique to each install that was prefixed to each ID.

Resources