Too many columns in a single preference db table?

Too many columns in a single preference db table? - database

I have an application that is essentially built out of many smaller applications. Each application has their own individual preferences, but all of them share the same 5 preferences, for example, whether the application is displayed in the nav, whether it is public, whether reports should be generated, etc.
All of these common preferences need to be known by any page in the web app because the navigation is constructed from it. So originally I put all these preferences in a single table. However as the number of applications grow (10 now, eventually around 30), the number of columns will end up being around 150-200 total. Most of these columns are just booleans, but it still worries me having that many columns in one table. On the other hand, if I were to split them apart into separate tables (preferences per app), I'd have to join them all together anyway every time I need to see the preferences, so why not just leave them all together?
In the application I can break the preferences into smaller objects so they are easier to work with, but from a db perspective they are a single entity. Is it better to leave them in one giant table, or break them apart into smaller ones but force many joins every time they are requested?

Which database engine are you using ? normally you will find some recommendations about recommended number of columns per table in your DB engine. Mostly Row size limitations, which should keep you safe.
Other options and suggestions include:
Assign a bit per config key in an integer, and use the logical "AND" operation to show only the key you are interested in at a given point in time. Single value read from DB, one quick Logical operation for each read of a config key.
Caching the preferences in memory, less round trips to DB servers, Based on frequency of changes , you may also having to clear the cache of each preference when it is updated.

Why not turn the columns into rows and use something like this:
This is a typical approach for maintaining lists of settings values.
The APP_SETTING table contains the value of the setting. The SETTING table gives you the context of what the setting is.
There are ways of extending this to add information such as which settings apply to which applications and whether or not the possible values for a particular setting are constrained to a specific list.

Well CommonPreferences and ApplicationPreferences would certainly make sense, and perhaps even segregating them in code (two queries instead of a join).
After that a table per application will make more sense.
Another way is going down the route suggested By Joel Brown.
A third would be instead of having individual colums or row per setting, you stuff all the non-common ones in to an xml snippet or serialise from a preferences class.
Which decision you make revolves around how your application does (or could use the data).
If you go down the settings table approach getting application settings as a row will be 'erm painful. Go down the xml snippet route and querying for a setting across applications will be even more painful than several joins.
No way to say what you should compromise on from here. I think I'd go for CommonPreferences first and see where I was at after that.

Related

System design: whether to normalize the departments or not

I'm working with two consultants in one project. The thing is we reached a point where both of them cannot get into an agreement and each offer a different approach.
The thing is we have a store with four departments and we want to find the best approach for working with all of them in the same database.
Each department sell different products: Cars, Boats, Jetskies and Motorbikes.
When the data is inserted or updated in each department there are some triggers to be fires so different workflows will begin, when adding a new car there are certain requirements that needs to be checked as well as the details of the car that are completely different than a boat. Also, regarding the data there are not many fields there are in common, I would say so far only the brand, color, model and year, everything else is specific for each deparment due to the different products and how they work with them..
Consultant one says:
Create one table for all the departments and use a column to identify what department the row belongs to, this way you will have only one trigger and inside the trigger you will then call the function/mehod you need for each record type.
Reason: you only have one table (with over 200 fields) and one trigger, is easier to maintain. Also if you need to report you just need to query one table and filter based on the record type. If you need to report for all the items you don't need to have multiple joins.
Consultant two says:
Create one table for each deparment and a trigger for each table.
Reason: you will have smaller tables (aprox 50 fields each) and is more flexible and you have it all separated. If you want to report you need to join the tables as you want to include data from different places.
I see the advantages of having everything in one place but if I want to expand or change anything I have the feeling I will bre creating a beast table as the data grows.
On the other side keep it separated look more appealing but will need to setup everything for each different table.
What would you say is the best approach?

You should probably listen to consultant number two.
The thing is, all design is trade-offs. You need to assess the pros and cons of each approach and you need to think about the risks that each design entails.
What happens when your design grows? (department 5, more details per product type,...)
What happens when the system scales up to higher transaction volumes?
What happens when your business rules change?
I've been doing this for a long time and I've seen some pendulums swing back and forth when it comes to what is "in fashion" as far as database and software best practices.
I'd say right now the prevailing wisdom is that separation of concerns is innately good. This means you should keep your program logic (trigger code) separate for each department. This makes sense because your logic will vary from one product type to the next since they mostly have distinct columns.
This second point is also important, because your stake in the ground for a transactional system should always be start with third normal form (or higher, if necessary). Sometimes you can get away without it, but four different types of objects with 40 or more distinct attributes each doesn't sound like a good candidate for jamming everything into one table. How do you keep track of which columns belong to which type of product, for example? A separate table for each product type keeps this clean and simple - and importantly - easy for your support programmers to understand.
Contrary to what consultant one is saying, having one trigger instead of four is not likely to be easier to maintain if that one trigger is a big bowl of spaghetti, or even four tidy, well written subroutines joined together with a switch type statement.
These days, programmers favour short, atomic, single-purpose functions (triggers, in your case).
If there is enough common data and common business logic that doing it four times seems awkward, then maybe you have a good candidate for a super-type / sub-type design.

I'll say one
These are all Products, It doesn't matter that its a Bike or a Car. You can control the fields and the object by RecordTypes and Page layouts and that will save you from having 4 Objects, which means potentially 8 new classes(if it follows my pattern it could be up to 20+) + all of the workflow rules and validation rules across the these new objects, it will be very hard to maintain a structure that has 4 objects but are all the same thing.. Tracking Products.
Down the road if you decide to add a new product such as planes, it will be very easy to add a plane to this object and the code will be able to pick up from there if needed. You will definitely need Record Types to manage each Product. The trigger code shouldn't be an issue if the consultants are building it properly meaning a trigger should never have any business logic so as long as that is followed all of the code will be maintainable

I will go with one.
I assume you have a large number of products and this list will grow in future. All these are Products at the end. They will have some common fields and common logic.
If you use Process Builder with Invocable classes instead of Triggers, you may be able to get away with just configuration changes while adding a new object, if its fields and functionality are same/similar to a existing object.
There may also be limitation on the number of different objects a profile has access to based on your license types.
Salesforce has a standard object called Product. Its a single object to be classifies based on record type.
I would have gone with approach two if this was not salesforce. Based on how salesforce works and the limitations it imposes one seems like a better and cleaner solution.

I would say option 2.
Why?
(1) I would find one table with 200+ columns harder to maintain. You're also then going to have to expose fields for an object that doesn't need said fields.
(2) You are also going to have to "hide" logic inside the trigger which then decides to do different actions based on the type of department etc...
(3) Option 2 involves more "scaffolding" and separate objects but those are objects are inherently smaller and easier to maintain and don't specifically hide logic or cause any sort of ambiguity.
(4) Option 2 abides by the single responsibility principle. Not everyone follows this I understand but I find it a good guiding principle, as the responsibility for the data lies with the individual table and the responsibility for triggered the action lies with the individual trigger as opposed to just being one mammoth entity/trigger.
** I would state that I am simply looking at this from a software development perspective, I am not sure whether or not SalesForce would handle this setup, but it is the way I would personally prefer to design it. :)

Option 2 for me.
You've said that there is little common data and the trigger logic is completely different. Here are some additional technical considerations.
Option 1 Warnings
The trigger would be a single point of failure and errors will be trickier to debug. I have worked with large triggers where broken logic near the top has stopped logic near the bottom from running, sometimes silently! You also have to maintain conditional guards to control the flow of logic based on the data which is another opportunity for error.
I'm not red hot on indexes but I believe performance will suffer due to no natural order of the multi-purpose data. More specific tables will yield better indexing strategies. Also, large rows can lead to fragmented indexes.
https://blogs.msdn.microsoft.com/pamitt/2010/12/23/notes-sql-server-index-fragmentation-types-and-solutions/
You would need extra consideration when setting nullable/default constraints on each surplus field not relevant to the product in question. These subtleties can introduce bugs and might make it harder if/when you decide to work with a data layer technology such as Entity Framework. E.g. the logical difference between NULL, 0 and 'None', especially on shared columns.

When is it appropriate to use UUIDs for a web project?

I'm busy with the database design of a new project, and I'm not sure whether to use UUIDs or normal table-unique auto-increment ids.
Up to now, the sites I've built have all run on a single server, and very heavy traffic has never been too much of a concern. However, this web application will eventually run concurrently on multiple servers, serve an API, and need to process thousands of requests per second, and I want to make sure that the design I choose now doesn't cripple any of those possibilities later.
I have my suspicions, of course, and they should be clear through the way I phrased my question, but I would like to hear from those with more experience what trouble I can run into later if I do or don't have UUIDs, and what I should really be basing my decision on.
So, in short: What are the considerations I should give into deciding whether or not to use UUIDs for all database models, so that any one object can be identified uniquely by one string, and when is it appropriate to use this as the primary key, instead of table-by-table auto-increment?
Note: I've seen this question (When are you truly forced to use UUID as part of the design?), and read all the answers, but they mostly answer "How rarely do UUIDs collide", instead of "When is it appropriate to use them".

One consideration that I've used when deciding on UUIDs vs. auto-increment ids is whether they're going to be user-visible, and if so, whether I want users to know how many I have of that table. For example, if I didn't want to make public the number of registered users my site has, I wouldn't assign auto-increment user ids.
And to address one other specific point you raised, it's still possible to use auto-incrementing ids with multiple servers (though not with the built-in MySQL). You just need to start all the ids at different offsets, and increment accordingly. That is, if you had 3 servers, you could start server A at 1, server B at 2, and server C at 3, and then increment the ids by 10 each time instead of 1. That way, you could guarantee no collisions.
And finally, the last thing I consider is how important performance is to my application. Integers are much more easily indexed than UUIDs that are string-based, so indexes are smaller, more quickly searched, etc.

UUID's or GUID's can be very useful especially for the web. If you use auto-increment values to store UserId anyone can view the source of your web pages and see the simplicity of it's use. They could then try any integer value to get data they are not supposed to see.
GUID's are not created in any sequential format, therefore if you create them one right after the other, there sequence can not easily be guessed.
I don't think it's necessary to use GUID's for simple lookup type data such as ColorId 1=Blue, 2=Red, 3=Green.
GUID's are also very useful for session and state management.
That's my $0.02

drawbacks of storing all ''things' in a central table

I am not sure if there is a term to describe this, but I have observed that content management systems store all kinds of data in a single table with their bare minimum properties while the meta data is stored in another table in form of key value pairs.
for eg. everything (blog posts, pages, images, events etc) is stored in one table and considered as a post.
I understand that this allows for abstraction and easy extensibility
we are considering designing our new project this way. It is not exactly a CMS but we plan to keep adding modules to it in stages. Lets say initially there will be only posts and images on which comments can be posted. Later on we might add videos which will also have the commenting feature.
what are the drawbacks of this approach ? and will it work for a requirement like ours ?
Thanks

The drawback is that the main table will get zillions of reads (and plenty of writes, too).
This means that there will be lots of lock contentions, heavy reindexing etc.
In order to mitigate this a bit you may consider splitting the "main table" in a series of not-so-main-tables.
Say, you will have one main table for "Posts" (possibly refined through metadata or subtables for specific types of posts, like Sticky, Announcement, Shoutbox, Private...)
One main table for Images (possibly refined for gifs, jpegs etc.)
One main table for Videos...
If this is a custom application (and not intended to be something that has to be "infinitely tweakable" like a CMS or a Portal framework) I think this kind of split is acceptable, and may provide some better performance (if you expect to have large amounts of data).
Regarding your "examples" comment... first of all, if you keep comments again in a single gigantic table you may have similar problems as if you kept all type of items in it.
Assuming this is not a problem, you can obviously put a sort of reference key (you can't use the normal foreign keys, of course) that links comments to their original item.
This works fine when you go from item to comments, a bit less when you have to move from comments to the originating item. So the tradeoff is about what kind of operations would be more frequent for your problem.

Simplicity and extensibility are indeed often attractive aspects of attribute-value and (as you say) "single table of things" approaches.
There's no 100% right answer here -- depending on your performance/throughput goals and extensibility needs, this approach might work for you too.
In most cases, however, where you know what kinds of data you will store, it's usually in your interest to model distinct entities into their own tables and relate the data accordingly. RDBMSes have been architected and refined over decades to cater to this use case and to simply use tables as generic dumping grounds doesn't typically buy you any distinct advantages, except the act of delaying the inevitable need to model your data properly. Furthermore, when you boil everything into one table, you then force users outside your app itself (if you have any, for example report writers) to have to struggle with your "model within a model", which can just make folks frustrated when they write queries, etc. And you will sink to your lowest common denominator -- if you want to optimize queries about type X and you have types Y and Z in that same table in droves, they will impact performance on querying X.
Again, to be clear, there is distinct benefit to the "all things in one table" name/value style metadata approaches. I have used them myself and turned against modeling for similar reasons. However, my advice is to limit yourself to times when you really need to do that (i.e., you need to implement something before you can correctly model the space of things you will need). Most typically, I find myself doing that when I'm prototyping complex systems and I need to get something going sooner than later.

Database Design Questions - Need Clarifications

i m designing a database using sql server 2005
main concept of our side is to import xml feeds from suppliers
different supplier can have different representation of data
the problem is i need to design table to store imported information
some of the columns are fixed means all supplier products must have similar data coming from the feed like , name, code, price, status, etc
but some product have optional details like
one product have might color property other might dont.
what is the best way to store these kind of scenario into the database.
should i create a table for mandatory columns and other tables to hold optional column.
or i should i list down all the column first and put them into the one table. (there might a lot of null values)
there will thousands of products and database speed is very essential .
we will be doing a lot of product comparison from different supplier
our database will be something like www.pricerunner.co.uk
i hope i explain the concept well

Thousands of products (so thousands of rows.) Thats really not many at all, so you could normalize the the optional data to a few separate tables without having a dramatic effect on query time.
I would say put your indexes in the correct place, optimize your queries, make sure you have filegroups split up nicely, etc (just the usual regular old database stuff) and you should be good.

Depends on how you want to access it.
As you say, speed is important - but what are you going t do with those extra, optional, bits of information? Do you need to store them at all? Assuming you do, how often do you need to access them?
Essentially, if you will always need to at least check if they're there, probably better to put them into one table. If you need to check anyway, might as well get it over with as part of the initial query.
If, on the other hand, you can usually run without bothering to check for these extra pieces, and only need to bother when specilly requested, then it might be better to put them into a different table. The join (or subsequent lookup) will be expensive - much more expensive than pulling nulls for empty columns - but if it's very infrequent, would probably cost less in runtime execution in the long run.
Also bear in mind the tradeoff in storage and transport terms - storing lots of empty fields does take some space, and sending back lots of empty fields takes network bandwidth.
If disk space is not a concern, but bandwidth is, make the application is carfully designed to minimse unecessary lookups, and then with tight queries you can store the extra (optional) data, but not pass it back unless it's requested.
So, it really all depends on what's important to you. Once you know what your overriding design concerns are, you will know which compromises to make to address those concerns at the expense of others. A balancing act.

Default database IDs; system and user values

As part of our current database work, we are looking at a dealing with the process of updating databases.
A point which has been brought up recurrently, is that of dealing with system vs. user values; in our project user and system vals are stored together. For example...
We have a list of templates.
1, <system template>
2, <system template>
3, <system template>
These are mapped in the app to an enum (1, 2, 3)
Then a user comes in and adds...
4, <user template>
...and...
5, <user template>
Then.. we issue an upgrade.. and insert as part of our upgrade scripts...
<new id> [6], <new system template>
THEN!!... we find a bug in the new system template and need to update it... The problem is how? We cannot update record using ID6 (as we may have inserted it as 9, or 999, so we have to identify the record using some other mechanism)
So, we've come to two possible solutions for this.
In the red corner (speed)....
We simply start user Ids at 5000 (or some other value) and test data at 10000 (or some other value). This would allow us to make modifications to system values and test them up to the lower limit of the next ID range.
Advantage...Quick and easy to implement,
Disadvantage... could run out of values if we don't choose a big enough range!
In the blue corner (scalability)...
We store, system and user data separately, use GUIDs as Ids and merge the two lists using a view.
Advantage...Scalable..No limits w/regard to DB size.
Disadvantage.. More complicated to implement. (many to one updatable views etc.)
I plump squarely for the first option, but looking for some ammo to back me up!
Does anyone have any thoughts on these approaches, or even one(s) that we've missed?

I have never had problems (performance or development - TDD & unit testing included) using GUIDs as the ID for my databases, and I've worked on some pretty big ones. Have a look here, here and here if you want to find out more about using GUIDs (and the potential GOTCHAS involved) as your primary keys - but I can't recommend it highly enough since moving data around safely and DB synchronisation becomes as easy as brushing your teeth in the morning :-)
For your question above, I would either recommend a third column (if possible) that indicates whether or not the template is user or system based, or you can at the very least generate GUIDs for system templates as you insert them and keep a list of those on hand, so that if you need to update the template, you can just target that same GUID in your DEV, UAT and /or PRODUCTION databases without fear of overwriting other templates. The third column would come in handy though for selecting all system or user templates at will, without the need to seperate them into two tables (this is overkill IMHO).
I hope that helps,
Rob G

I recommend using the second with the modification that you store the system and user values in one table. GUID is quite reliable in this manner.
Another idea: use any text-based ID (not necessary GUID), which you give for the system values and is generated by a random string or a string based on some kind of custom logic for the user values.
Another idea: use the first approach, but extend the table with a flag which shows if a value is system or user. Maybe this is the easiest. Ok, you have to write some kind of mechanism to update the correct system value, but it can be done easily.

+1 for Biri's text based ID - define a "template_mnemonic" text based column and make it the primary key. This will be a known value when you insert it as you, the developers will have decided on it (or auto-generated it) and you will always be able to reference a template by its mnemonic regardless of how many user specified templates there are. It also allows users to have a meaningful naming convention for their templates.

Maybe I didn't get it, but couldn't you use GUIDs as Ids and still have user and system data together? Then you can access the system data by the (non-changable) GUIDs.

I don't think that GUID should make any problem.
If you want to avoid it, then use a flag:
ID int
template whatever
flag enum/int/bool
Flag shows whether the actual value is a system or a user value.
If you would like to update a system value, then ask only for system values ordered by ID, and it will show you actual order of insertion (you should have a bigint or something for ID to make sure that it doesn't get full and it doesn't get the deleted IDs back to work). With this list the x. record is the x. inserted system value.

I think there is a better third solution.
It strikes me that you're storing two different things in the same table and that you might be better off creating 2 separate tables one for user templates and one for system templates. You might then be able to create a view over the two tables to make them appear as a single object to your application.
Obviously I don't have full knowledge of your application and this may be impossible for you for any number of reasons but I think it's a neater solution than GUIDs and way safer than ranges of IDs (seriously don't do ID ranges it'll bite you one day)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight