Should i search content on database by id or name? - database

I have a CMS that has two methods to query contents. One that queries by id and another one queries by the name of the content.
ContentManager.Select(12);
or
ContentManager.Select("Content Name");
The way I see the first one would be faster, because the id is an index and doesn't involve string comparison. While the second one is much easier to work with.
I have worked, for maintenance reasons, with the second one. But if i change the content name, the Select obviously is not going to work anymore. But the Id is supposed to be only o database level, and not visible from the CMS forms.
Edit:Also, if a content were to be deleted and reinserted the string select would work and the id select wouldn't.
I can't come to a common point between these two approaches.

Selecting by the primary key gives best possible performance, but that's not always your only motivation. You might be able to add an index to the content name column, depending on it's width and your read/write ration (and depending on how much control over the database you have, I suppose).
Verdict, if you have the id, select by the id, if you don't and it's not ruining your performance, don't sweat using the content name.

Depends on which one is indexed... So yes you are right, in this case use the ID... If there is a need to also search by name, add another index using the name..

IDs ususally work best in databases. However, you are at the mercy of the CMS, it might be storing both those in an array and using the same exact select statement. Who knows? view the source code and see what is going on.
whatever you do, stick to one style in all of your code.

Related

CakePHP virtual HABTM table/relation, something like fixture

first of all I'd like to tell you that you're terrific audience.
I'm making an application where I have model Foo with table Foos. And I'd like to give Foo another parameter, HABTM parameter, lets say Bar. But I'd rather don't create table for Bar. Because Bar will have like 5 positions on start and in 5 years it will grow to maybe 7 positions or not at all. So I don't see a need to create another table and make CakePHP look into that table with another SELECT. Anyone have an idea this can be achieved ?
One solution I think is making an fixture for Bars table and adding only Bars_Foos table for real (it won't be big anyway). But I can't find a way to use test fixtures in normal Controller
Second solution is to save a JSON or serialized array in Foo one field and move logic to model, but I don't know if it is best solution. Something like virtual field.
Real life example:
So I have like Bikes. And every Bike have its main_type. Which is for now {"MTB","Road","Trekking","City","Downhill"}. I know that in long time this list would not grow much. Maybe 2 or 5 positions in few years. Still it will relatively short.
(For those who say that there maybe a hundred of specialized bike types. I have another parameter column specialized_type)
It needs to be a HABTM relation, but main_types table will be very small, so Id like to avoid creating it and find a way for simpler solution.
Because
It bothers MySQL for such small amount of data
It complicates MySQL queries
I have to make additional model for MainType
I have more models to unbind when I don't need most of data and would like use recursive
Insert here anything you'd like...
Judging from your real life example, I'd say you're on the wrong track. The queries won't be complicated, CakePHP uses additional queries for HABTM relations, it would be just one additional query which shouldn't be very costly, also it's very easy to sparse it out by using the containable behaviour. And if you really need to use recursive only (for whatever reason), then it's just one single additional model to unbind, that doesn't seem like overkill to me.
This might not be what you wanted to hear, but I really think a proper database solution is better than trying to hack in "virtual data". Also note that fixtures as used in tests, only define data which is written to the database on the fly when running the test, so that would be definitely more costly than using data that already exists in the database.
Maybe you'll get a small performance boost for selects that do not query the main type when using an additional column to store the data, but you'll definitely lose all the flexibility that the RDBMS has to offer, including faster selects using proper indexing, affecting multiple records by updating a single related value, etc. That doesn't sound like a good trade-off to me. Think about it, how would you select all Downhill Tracking bikes when this information is stored as a string in a single column? You would probably end up using ugly LIKE selects.
Now wait, there's a SET data type in MySQL hat can hold multiple values. Right, and it looks easier and less complex. Right, but in the background it isn't, while using a complex looking join-query can be pretty fast using proper indexing, the query for the SET type will have to scan every single row since the data stored in the column cannot be indexed appropriately in order to make more specific selects.
In the end it probably depends on your data, so I'd suggest testing both methods in your specific environment and see how they compare under workload.

Is it ever a good idea to have a record in a reference table in your database that represent "all other records"?

I have an asp.net-mvc website with a SQL Server backend. I am simplifying my situation to highlight and isolate the issue. I have 3 tables in the DB
Article table (id, name, content)
Location table (id, name)
ArticleLocation table (id, article Id, location Id)
On my website, when you create an article, you select from a multiselect listbox the locations where you want that article sent.
There are about 25 locations so I was debating adding a new location called "Global" as a shortcut instead of having the person select 25 different items from a listbox. I could still do this as a shortcut on the front end but now I am debating if there is benefit for this to flow through to the backend.
So if I have an article that goes global, instead of having 25 records in the ArticleLocation table, I would only have one and then I would do some tricks on the front end to select all of the items. I am trying to figure out if this is a very bad idea.
Things I can think about that are making me nervous:
what if I create an article and choose global but then last in the future 3 new locations are added. Without this global setting, these 3 location would not get the article but in the new way, they would. I am not sure what is better as the second thing might actually be what you want but its a little less explicit.
I have a requirement on a report, I want to filter by all articles that are global. Imagine I would need a article.IsGlobal() methode. Right now I guess I could say if a project has the same count of locations as all of the records in the location table I could translate that to being deemed global but again since people can add new locations, I feel like this approach is somewhat flaky.
Does anyone have any suggestions for this dilemna around creating records in a reference data table that really reflect "all records". Appreciate any advice
By request, here is my comment promoted to an answer. It's an opportunity to expand on it, too.
I'll limit my answer to a system with a single list of locations. I've done the corporate hierarchy thing: Companies, Divisions, Regions, States, Counties, Offices and employees or some such. It gets ugly.
In the case of the OP's question, it seems that adding an AllLocations bit to the Articles table makes the intention clear. Any article with the flag set to 1 appears in all locations, regardless of when they were created, and need not have any entries in the ArticleLocation table. An article can still be explicitly added to all existing locations if the author does not want it to automatically appear in future locations.
Implementation involves a little more work. I would add INSERT and UPDATE triggers to the Article and ArticleLocation tables to enforce the rule that either the AllLocations bit is set and there are no corresponding rows in ArticleLocation, or the bit is clear and locations may be explicitly set. (It's a personal preference to have the database defend itself against "bad data" whenever it's practical to do so.)
Depending on your needs, a table-valued function is a good way to hide some of the dirty work, e.g. dbo.GetArticleIdsForLocation( LocationId ) can handle the AllLocations flag internally. You can use it in stored procedures and ad-hoc queries to JOIN with Article. In other cases a view may be appropriate.
Another feature that you are welcome to borrow ("Steal from your friends!") is to have the administrator's landing page be an "exceptions" page. It's a place where I display things that vary from massive flaming disasters to mere peccadillos. In this case, articles that are associated with zero locations would qualify as something non-critical, but worth checking up on.
Articles that are explicitly shown in every location might be of interest to someone adding a new location, so I would probably have a web page for that. It may be that some of the articles should be updated to account for the new location explicitly or reconsidered for being changed to all locations.
Is it ever a good idea ... that represent “all other records”?
Is it it ever a good idea to represent a tree in table? Root of a tree represents “all other records”.
Trees and hierarchies are not simple to work with, but there are many examples, articles and books that tackle the problem -- like Celko's Trees and Hierarchies in SQL; Karwin's SQL Antipatterns.
So what you actually have here is a hierarchy (maybe just a tree) -- it may help to approach the problem that way from the start. The Global from your example is just another Location (root of a tree), so when a new location is added, you may decide if it will be a child of the Global or not.
Facts
Location(LocationID) exists.
Location(LocationID) is contained in Parent Location(LocationID).
Article(ArticleID) exists.
Article(ArticleID) is available at Location(LocationID).
Constraints
Each Location is contained in at most one Parent Location. It is possible that for some Parent Location, more than one Location is contained in that Parent Location.
It is possible that some Article is available at more than one Location
and that for some Location, more than one Article is available at that Location.
Logical
This way you can assign any location to an article -- but have to resolve it to the leaf level when needed.
The hierarchy (tree) is here represented in the "naive way"; use closure table, nested sets or enumerated path instead -- or, if you like recursion...
tl;dr
In this case as I understand it, I think it is a good idea to create a "global" location in the Location table. I definitely find it preferable to creating a "global" flag in the Article table.
"Is it ever a good idea...?" is not a question we like to answer on SO. It's mostly a debate question, not a Q&A question, and besides, we have enough creativity in our community to come up with some example where "it" would be a good idea, regardless.
To your more specific question, how do I represent "all locations" in the database? that is a judgement call based on your business requirements.
Do you want "all locations" to include future locations?
If not, then probably you should only implement "all locations" as a helper that selects all current locations in the database.
Do you anticipate having a hierarchy of locations?
Real-world locations have significant hierarchy:
Global
Multi-national (continent, trading block)
Country
Administrative region (state, province, canton, etc.)
City
Neighborhood
If you think you are going to want to have the option to choose, say, a Country, instead of Global, then implementing a hierarchical representation such as Damir suggests is the best way to go. However, if you are not sure if you are ever going to have any other grouping of locations besides Global, a hierarchical data structure is too much work for now. All you need to do is make sure your current implementation has a migration path to a possible future hierarchical representation.
Global as a pseudo-location
If you do want future locations included in Global and do not need a hierarchical location structure, then my instinct based on years of experience would be to create "Global" as a pseudo-location. That is, Global would be one of the locations in the Location table, but it would have a special meaning. This is definitely a trade-off, but has the benefit of not altering the data structure to support Global which means that all the special cases that "Global" creates are handled by excluding or including some Locations in queries rather than by checking some flags somewhere. (Or if you like flags, you can add a 'pseudo-location' flag to the Location table.)
With Global as a location, additions or deletions to the Location table are handled automatically. The query for all Global articles is straightforward: the same as the query for all articles for any other Location. Reporting on articles by location is also straight forward, with Global articles appearing in reports just like any other location. You can also represent the difference between a "Global" article (all current and future locations) and an "all locations" article (all current locations but no future locations).
Selecting all articles that should be visible at a specific location is slightly harder, it's now a check against "Global" as well as that location, but at least it is checking for 2 values in the same table versus checking two different tables.
SELECT article_id FROM ArticleLocation WHERE location_id in (1, 5);
vs
SELECT article_id FROM ArticleLocation WHERE location_id = 5
UNION
SELECT id FROM Article WHERE is_global;
From the logic, as you described it, GLOBAL should be actually global and stay global, even if you add new locations (problem 1 solved). But this also implies that GLOBAL is not the same thing as "all locations" (as there might also exist some other locations we don't have defined yet). I think this logic is needed especially by your requirement 2 - otherwise it would completely fail on adding new locations.
Analysis done! From the above we see that GLOBAL is something above all those locations. There's no sense in trying to define it as a Location. Go for the easiest solution!
Article table (id, name, content, global)
i.e. boolean flag - article is global or not. In the UI, do it simply as a checkbox - if checked, the multiselect box will be disabled. Simple, easy, requirements met. Done!
Is there a need to automatically add some articles to new locations when new locations are added? If yes then in such case I’d consider adding new ‘global’ property in the backend.
Otherwise it probably isn’t worth the effort. Even if you had 10000 articles and 20 different locations selected for each article that would be about 200k records which is not that bad when you set indexes.
Check your existing data and see how people are already choosing locations. If most users select only several locations and not all then it’s really an edge case and you shouldn’t be working on it unless it really creates problems.
I agree with #HABO's comment (he should have posted it--if he does, upvote him). Adding an atrribute to table Article to identify those items that are to be associated with all Locations, present and future, presumably for the lifetime of the article, should save you time and effort over the long run. Sure, triggers and counts-against-all will do the trick, but they're awkward and would be a pain to support if/when subsequent system changes come along. The UI would be simpler to use, as the user just has to click a checkbox (or whatever) and not multi-click everything in a dropdown of unforseeable length.
(#Damir's hierarchy idea would work as well, but--speaking from a bit too much experience--they're a hassle to work with, and I wouldn't introduce one here unless there was significantly more system and/or business use to get out of it.)

Too many columns in a single preference db table?

I have an application that is essentially built out of many smaller applications. Each application has their own individual preferences, but all of them share the same 5 preferences, for example, whether the application is displayed in the nav, whether it is public, whether reports should be generated, etc.
All of these common preferences need to be known by any page in the web app because the navigation is constructed from it. So originally I put all these preferences in a single table. However as the number of applications grow (10 now, eventually around 30), the number of columns will end up being around 150-200 total. Most of these columns are just booleans, but it still worries me having that many columns in one table. On the other hand, if I were to split them apart into separate tables (preferences per app), I'd have to join them all together anyway every time I need to see the preferences, so why not just leave them all together?
In the application I can break the preferences into smaller objects so they are easier to work with, but from a db perspective they are a single entity. Is it better to leave them in one giant table, or break them apart into smaller ones but force many joins every time they are requested?
Which database engine are you using ? normally you will find some recommendations about recommended number of columns per table in your DB engine. Mostly Row size limitations, which should keep you safe.
Other options and suggestions include:
Assign a bit per config key in an integer, and use the logical "AND" operation to show only the key you are interested in at a given point in time. Single value read from DB, one quick Logical operation for each read of a config key.
Caching the preferences in memory, less round trips to DB servers, Based on frequency of changes , you may also having to clear the cache of each preference when it is updated.
Why not turn the columns into rows and use something like this:
This is a typical approach for maintaining lists of settings values.
The APP_SETTING table contains the value of the setting. The SETTING table gives you the context of what the setting is.
There are ways of extending this to add information such as which settings apply to which applications and whether or not the possible values for a particular setting are constrained to a specific list.
Well CommonPreferences and ApplicationPreferences would certainly make sense, and perhaps even segregating them in code (two queries instead of a join).
After that a table per application will make more sense.
Another way is going down the route suggested By Joel Brown.
A third would be instead of having individual colums or row per setting, you stuff all the non-common ones in to an xml snippet or serialise from a preferences class.
Which decision you make revolves around how your application does (or could use the data).
If you go down the settings table approach getting application settings as a row will be 'erm painful. Go down the xml snippet route and querying for a setting across applications will be even more painful than several joins.
No way to say what you should compromise on from here. I think I'd go for CommonPreferences first and see where I was at after that.

Conflicting desires in Database Design, with fields of two similar functions

Okay, so I'm making a table right now for "Box Items".
Now, a Box Item, depending on what it's being used for/the status of the item, may end up being related to a "Shipping" box or a "Returns" box.
A Box Item may be defective:if it is, a flag will be set in the Box Item's row (IsDefective), and the Box Item will be put in a "Returns" box (with other items to be returned to that vendor). Otherwise, the Box Item will eventually be put into a "Shipping" box (with other items to be shipped). (Note that Shipping and Returns boxes have their own tables: there's not one common table for all boxes... though maybe I should consider doing that if possible as a third possibility?)
Maybe I'm just not thinking clearly today, but I started questioning what should be done in this situation.
My gut tells me that I should have a separate field for each possible relation, even if only one of the relations can happen at any given time, which would make the schema for Box Items look like:
BoxItemID
Description
IsDefective
ShippingBoxID
ReturnBoxID
etc...
This would make the relations clear, but it seems wasteful (since only one of the relations will be used at any time). So then I thought I could have just one field for the BoxID, and determine which BoxID it's referring to (a Shipping or a Returns Box ID) based on the IsDefective field:
BoxItemID
Description
IsDefective
BoxID
etc...
This seems less wasteful, but doesn't sit right with me. The relation isn't obvious.
So, I put it to you, database gurus of Stackoverflow. What would you do in this situation?
EDIT: Thank you everyone for your input! It's given me a lot to think about. For one, I'm going to use an ORM next time I start a project like this. =) For two, since I'm not right now, I'll bite the four bytes and use two fields.
Thanks everyone again!
I'm with Psychotic Venom and mattlant.
Going the polymorphic route (having to figure out which table your foreign key points to based on the contents of another field) is going to be a pain. Coding the constraints for that maybe tough (I'm not sure most databases would support that natively, I think you'd have to use a trigger).
Do items ever move between the tables? Sticking with two tables with identical definitions where one is for returns and one is for shipping may be the easiest route. If you want to stick with the definition you first proposed (with the two separate fields) is perfectly reasonable.
"Premature optimization is the root of all evil" and all that. While it seems wasteful, remember what you're storing. Since they are IDs they are probably just integers, maybe 4 bytes. Wasting four bytes per record is basically nothing. In fact, due to padding to put things on even addresses or other such things it may be "free" to put that extra field in there. It all depends on the DB design.
Unless you have a very good reason to go the polymorphic route (like you're on an embedded system with little memory or you have to replicate across some really slow 9600bps link) it probably won't be worth the headaches you can end up with. Having to write all those special cases into your queries can get annoying.
Quick example: doing a join between two tables where if you want to join is based on if the isDefective flag is set is going to be a pain. Being able to just use one of the two columns alone is probably enough of a hassle you may save, at least for me.
I would consider making a single table for the boxes and the box type be a column of the box table. This would simplify the relationships and make it easy to still query for box type. So the box item only has one foreign key to the boxId.
I'd use what Hibernate calls Table-per-subclass, so my DB would wind up with 3 tables for Boxes: Box, ShippingBox, and ReturnBox. The FK in BoxItem would point to Box.
What you're talking about is polymorphic relations. A single ID that can reference multiple other tables. There are several frameworks that support this, however, it is (potentially) bad for database integrity (that could be a whole other discussion whether or not your database or your application should maintain referential integrity).
What about this?
BoxItem:
BoxItemID, Description, IsDefective
Box:
BoxID, Description
BoxItemMap:
BoxID, BoxItemID, BoxItemType
Then you can have BoxItemType be an enumeration, or an integer where you define constants in your application as "Return" or "Shipping" as the type of box.
Agree about the polymorphic discussion above, although it has potential to be used poorly, it is still a viable solution.
Basically you have a base table called box. Then you have two other tables, shipping box and return box. Those two add any extra fields that are special to them. they are related to box with a 1:1 fk.Boz base table has the common fields of all box types.
You relate BoxItem with the box table. The way you you get the proper box type is by doing a query that joins the child box with the root box based on the key. The record that has in both the base box and the child box is of that type.
You just have to be careful like mentioned that when you create a box type that it is done correctly. BUt thats what testing is for. The code to add them only needs ot written once. Or use an ORM.
Almost all ORM's support this strategy.
I'd go with just a single BoxItems table with IsDefective, ShippingBoxID, the shipping-box-related fields, ReturnBoxID and the return-box-related fields. Some fields will always be NULL for each record.
This is a very simple and self-evident design that the next developer is unlikely to be confused by. In theory this design is inefficient because of the guaranteed empty fields for each row. In practice, databases tend to have a minimum required storage size for each row anyway, so (unless the number of fields is huge) this design is as efficient as possible anyway, and much easier to code to.
I'd probably go with:
BoxTable:
box_id, box_descrip, box_status_id ...
1, Lovely Box, 1
2, Borked box, 2
3, Ugly Box, 3
4, Flammable Box, 4
BoxStatus:
box_status_id, box_status_name, box_type_id, ....
1,Shippable, 1
2,Return, 2
3,Ugly, 2
4,Dangerous,3
BoxType:
box_type_id, box_type_name, ...
1, Shipping box, ...
2, Return box, ....
3, Hazmat box, ...
That way the Box Status defines the box type, and it's flexible if you need to expand into a few more status levels or box types later on.

Default database IDs; system and user values

As part of our current database work, we are looking at a dealing with the process of updating databases.
A point which has been brought up recurrently, is that of dealing with system vs. user values; in our project user and system vals are stored together. For example...
We have a list of templates.
1, <system template>
2, <system template>
3, <system template>
These are mapped in the app to an enum (1, 2, 3)
Then a user comes in and adds...
4, <user template>
...and...
5, <user template>
Then.. we issue an upgrade.. and insert as part of our upgrade scripts...
<new id> [6], <new system template>
THEN!!... we find a bug in the new system template and need to update it... The problem is how? We cannot update record using ID6 (as we may have inserted it as 9, or 999, so we have to identify the record using some other mechanism)
So, we've come to two possible solutions for this.
In the red corner (speed)....
We simply start user Ids at 5000 (or some other value) and test data at 10000 (or some other value). This would allow us to make modifications to system values and test them up to the lower limit of the next ID range.
Advantage...Quick and easy to implement,
Disadvantage... could run out of values if we don't choose a big enough range!
In the blue corner (scalability)...
We store, system and user data separately, use GUIDs as Ids and merge the two lists using a view.
Advantage...Scalable..No limits w/regard to DB size.
Disadvantage.. More complicated to implement. (many to one updatable views etc.)
I plump squarely for the first option, but looking for some ammo to back me up!
Does anyone have any thoughts on these approaches, or even one(s) that we've missed?
I have never had problems (performance or development - TDD & unit testing included) using GUIDs as the ID for my databases, and I've worked on some pretty big ones. Have a look here, here and here if you want to find out more about using GUIDs (and the potential GOTCHAS involved) as your primary keys - but I can't recommend it highly enough since moving data around safely and DB synchronisation becomes as easy as brushing your teeth in the morning :-)
For your question above, I would either recommend a third column (if possible) that indicates whether or not the template is user or system based, or you can at the very least generate GUIDs for system templates as you insert them and keep a list of those on hand, so that if you need to update the template, you can just target that same GUID in your DEV, UAT and /or PRODUCTION databases without fear of overwriting other templates. The third column would come in handy though for selecting all system or user templates at will, without the need to seperate them into two tables (this is overkill IMHO).
I hope that helps,
Rob G
I recommend using the second with the modification that you store the system and user values in one table. GUID is quite reliable in this manner.
Another idea: use any text-based ID (not necessary GUID), which you give for the system values and is generated by a random string or a string based on some kind of custom logic for the user values.
Another idea: use the first approach, but extend the table with a flag which shows if a value is system or user. Maybe this is the easiest. Ok, you have to write some kind of mechanism to update the correct system value, but it can be done easily.
+1 for Biri's text based ID - define a "template_mnemonic" text based column and make it the primary key. This will be a known value when you insert it as you, the developers will have decided on it (or auto-generated it) and you will always be able to reference a template by its mnemonic regardless of how many user specified templates there are. It also allows users to have a meaningful naming convention for their templates.
Maybe I didn't get it, but couldn't you use GUIDs as Ids and still have user and system data together? Then you can access the system data by the (non-changable) GUIDs.
I don't think that GUID should make any problem.
If you want to avoid it, then use a flag:
ID int
template whatever
flag enum/int/bool
Flag shows whether the actual value is a system or a user value.
If you would like to update a system value, then ask only for system values ordered by ID, and it will show you actual order of insertion (you should have a bigint or something for ID to make sure that it doesn't get full and it doesn't get the deleted IDs back to work). With this list the x. record is the x. inserted system value.
I think there is a better third solution.
It strikes me that you're storing two different things in the same table and that you might be better off creating 2 separate tables one for user templates and one for system templates. You might then be able to create a view over the two tables to make them appear as a single object to your application.
Obviously I don't have full knowledge of your application and this may be impossible for you for any number of reasons but I think it's a neater solution than GUIDs and way safer than ranges of IDs (seriously don't do ID ranges it'll bite you one day)

Resources