Auto-Complete/Primary Key as String - PostgreSQL - database

I setup a database that is not too complex but still nonetheless has multiple many-to-many relationships. Let me explain the database first briefly using three tables(there are many more, but just to keep things simple):
Database is storing information about projects completed. One attribute is software used. So I have three tables(with respective columns/keys):
tblProjects(ProjectID[PK], ProjectTitle, etc...)
tblProjectsSoftware(SoftwareID[FK], ProjectID[FK], UniqueID[PK])
tblSoftwareUsed(SoftwareID[PK], SoftwareName)
In order to make data entry easier in phppgadmin, I was considering just making 'SoftwareName' the primary key in tblSoftwareUsed. This is because when I go to enter the software associated with certain projects into tblProjectsSoftware, I can only use the auto-complete feature on the SoftwareID column which is just more or less a meaningless number.
As you can see, when entering data into the SoftwareID column of tblSoftwareUsed, I would only be able to 'filter' results by the ID and not the name. When this database gets large, it may not be an issue for software, but there are some other attributes that will have tons of records. To explain that further, I would start my data entry by creating a record for the project in tblProjects. Then I would create new records (if necessary) for software used. Then, when entering data into tblProjectsSoftware, I would either have to know the ID of the software or click through a few pages to find it.
So, my question is, would I have any issues by making the name of the software my Primary Key, or would it be better to just leave it as is with the ID as the PK? Furthermore, maybe I am missing an option to make 'SoftwareName' searchable as in addition to the ID.

There are advantages and disadvantages to using surrogate keys, which are discussed at length in this wikipedia article:
http://en.wikipedia.org/wiki/Surrogate_key
Borrowing their headers...
Advantages:
Immutability
Requirement changes
Performance
Compatibility
Uniformity
Validation
Disadvantages:
Disassociation
Query optimization
Normalization
Business process modeling
Inadvertent disclosure
Inadvertent assumptions
More often than not, you'll want to use a surrogate key for practical reasons -- such as avoiding headaches when you need to update a software name.

Related

Single Big SQL Server lookup table

I have a SQL Server 2008 database with a snowflake-style schema, so lots of different lookup tables, like Language, Countries, States, Status, etc. All these lookup table have almost identical structures: Two columns, Code and Decode. My project manager would like all of these different tables to be one BIG table, so I would need another column, say CodeCategory, and my primary key columns for this big table would be CodeCategory and Code. The problem is that for any of the tables that have the actual code (say Language Code), I cannot establish a foreign key relationship into this big decode table, as the CodeCategory would not be in the fact table, just the code. And codes by themselves will not be unique (they will be within a CodeCategory), so I cannot make an FK from just the fact table code field into the Big lookup table Code field.
So am I missing something, or is this impossible to do and still be able to do FKs in the related tables? I wish I could do this: have a FK where one of the columns I was matching to in the lookup table would match to a string constant. Like this (I know this is impossible but it gives you an idea what I want to do):
ALTER TABLE [dbo].[Users] WITH CHECK ADD CONSTRAINT [FK_User_AppCodes]
FOREIGN KEY('Language', [LanguageCode])
REFERENCES [dbo].[AppCodes] ([AppCodeCategory], [AppCode])
The above does not work, but if it did I would have the FK I need. Where I have the string 'Language', is there any way in T-SQL to substitute the table name from code instead?
I absolutely need the FKs so, if nothing like this is possible, then I will have to stick with my may little lookup tables. any assistance would be appreciated.
Brian
It is not impossible to accomplish this, but it is impossible to accomplish this and not hurt the system on several levels.
While a single lookup table (as has been pointed out already) is a truly horrible idea, I will say that this pattern does not require a single field PK or that it be auto-generated. It requires a composite PK comprised of ([AppCodeCategory], [AppCode]) and then BOTH fields need to be present in the fact table that would have a composite FK of both fields back to the PK. Again, this is not an endorsement of this particular end-goal, just a technical note that it is possible to have composite PKs and FKs in other, more appropriate scenarios.
The main problem with this type of approach to constants is that each constant is truly its own thing: Languages, Countries, States, Statii, etc are all completely separate entities. While the structure of them in the database is the same (as of today), the data within that structure does not represent the same things. You would be locked into a model that either disallows from adding additional lookup fields later (such as ISO codes for Language and Country but not the others, or something related to States that is not applicable to the others), or would require adding NULLable fields with no way to know which Category/ies they applied to (have fun debugging issues related to that and/or explaining to the new person -- who has been there for 2 days and is tasked with writing a new report -- that the 3 digit ISO Country Code does not apply to the "Deleted" status).
This approach also requires that you maintain an arbitrary "Category" field in all related tables. And that is per lookup. So if you have CountryCode, LanguageCode, and StateCode in the fact table, each of those FKs gets a matching CategoryID field, so now that is 6 fields instead of 3. Even if you were able to use TINYINT for CategoryID, if your fact table has even 200 million rows, then those three extra 1 byte fields now take up 600 MB, which adversely affects performance. And let's not forget that backups will take longer and take up more space, but disk is cheap, right? Oh, and if backups take longer, then restores also take longer, right? Oh, but the table has closer to 1 billion rows? Even better ;-).
While this approach looks maybe "cleaner" or "easier" now, it is actually more costly in the long run, especially in terms of wasted developer time, as you (and/or others) in the future try to work around issues related to this poor design choice.
Has anyone even asked your project manager what the intended benefit of this is? It is a reasonable question if you are going to spend some amount of hours making changes to the system that there be a stated benefit for that time spent. It certainly does not make interacting with the data any easier, and in fact will make it harder, especially if you choose a string for the "Category" instead of a TINYINT or maybe SMALLINT.
If your PM still presses for this change, then it should be required, as part of that project, to also change any enums in the app code accordingly so that they match what is in the database. Since the database is having its values munged together, you can accomplish that in C# (assuming your app code is in C#, if not then translate to whatever is appropriate) by setting the enum values explicitly with a pattern of the first X digits are the "category" and the remaining Y digits are the "value". For example:
Assume the "Country" category == 1 and the "Language" catagory == 2, you could do:
enum AppCodes
{
// Countries
United States = 1000001,
Canada = 1000002,
Somewhere Else = 1000003,
// Languages
EnglishUS = 2000001,
EnglishUK = 2000002,
French = 2000003
};
Absurd? Completely. But also analogous to the request of merging all lookup tables into a single table. What's good for the goose is good for the gander, right?
Is this being suggested so you can minimise the number of admin screens you need for CRUD operations on your standing data? I've been here before and decided it was better/safer/easier to build a generic screen which used metadata to decide what table to extract from/write to. It was a bit more work to build but kept the database schema 'correct'.
All the standing data tables had the same basic structure, they were mainly for dropdown population with occasional additional fields for business rule purposes.

Database Child Table with Two Possible Parents

First off, I'm not sure how exactly to search for this, so if it's a duplicate please excuse me. And I'm not even sure if it'd be better suited to one of the other StackExchange sites; if so, please let me know and I'll ask over there instead. Anyways...
Quick Overview of the Project
I'm working on a hobby project -- a writer's notebook of sorts -- to practice programming and database design. The basic structure is fairly simple: the user can create notebooks, and under each notebook they can create projects associated with that notebook. Maybe the notebook is for a series of short stories, and each project is for an individual story.
They can then add items (scenes, characters, etc.) to either a specific project within the notebook, or to the notebook itself so that it's not associated with a particular project. This way, they can have scenes or locations that span multiple projects, as well as having some that are specific to a particular project.
The Problem
I'm trying to keep a good amount of the logic within the database -- especially within the table structure and constraints if at all possible. The basic structure I have for a lot of the items is basically like this (I'm using MySql, but this is a pretty generic problem -- just mentioning it for the syntax):
CREATE TABLE SCENES(
ID BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY NOT NULL,
NOTEBOOK BIGINT UNSIGNED NULL,
PROJECT BIGINT UNSIGNED NULL,
....
);
The problem is that I need to ensure that at least one of the two references, NOTEBOOK and/or PROJECT, are set. They don't have to both be set -- PROJECT has a reference to the NOTEBOOK it's in. I know I could just have a generic "Parent Id" field, but I don't believe it'd be possible to have a foreign key to two tables, right? There's also the possibility of adding additional cross-reference tables -- i.e. SCENES_X_NOTEBOOKS and SCENES_X_PROJECTS -- but that'd get out of hand pretty quickly, since I'd have to add similar tables for each of the different item types I'm working with. That would also introduce the problem of ensuring each item has an entry in the cross reference tables.
It'd be easy to put this kind of logic in a stored procedure or the application logic, but I'd really like to keep it in a constraint of some kind if at all possible, just to eliminate any possibility that the logic got bypassed some how.
Any thoughts? I'm interested in pretty much anything -- even if it involves a redesign of the tables or something.
The thing about scenes and characters is that a writer might drop them from their current project. When that happens, you don't want to lose the scenes and characters, because the writer might decide to use them years later.
I think the simplest approach is to redefine this:
They can then add items (scenes, characters, etc.) to either a
specific project within the notebook, or to the notebook itself so
that it's not associated with a particular project.
Instead of that, think about saying this.
They can then add items (scenes, characters, etc.) to either a
user-defined project within the notebook, or to the system-defined
project named "Unassigned". The project "Unassigned" is for things not
currently assigned to a user-defined project.
If you do that, then scenes and characters will always be assigned to a project--either to a user-defined project, or to the system-defined project named "Unassigned".
I'm unclear as to what exactly are you requirements, but let me at least try to answer some of your individual questions...
The problem is that I need to ensure that at least one of the two references, NOTEBOOK and/or PROJECT, are set.
CHECK (NOTEBOOK IS NOT NULL OR PROJECT IS NOT NULL)
I don't believe it'd be possible to have a foreign key to two tables, right?
Theoretically, you can reference two tables from the same field, but this would mean both of these tables would have to contain the matching row. This is probably not what you want.
You are on the right track here - let the NOTEBOOK be the child endpoint of a FK towards one table and the PROJECT towards the other. A NULL foreign key will not be enforced, so you don't have to set both of them.
There's also the possibility of adding additional cross-reference tables -- i.e. SCENES_X_NOTEBOOKS and SCENES_X_PROJECTS -- but that'd get out of hand pretty quickly, since I'd have to add similar tables for each of the different item types I'm working with.
If you are talking about junction (aka. link) tables that model many-to-many relationships, then yes - you'd have to add them for each pair of tables engaged in such a relationship.
You could, however, minimize the number of such table pairs by using inheritance (aka. category, subclassing, subtype, generalization hierarchy...). Imagine you have a set of M tables that have to be connected to a second set of N tables. Normally, you'd have create M*N junction tables. But if you inherit all tables in the first set from a common parent table, and do the same for the second set, you can now connect them all through just one junction table between these two parent tables.
The full discussion on inheritance is beyond the scope here, but you might want to look at "ERwin
Methods Guide", chapter "Subtype Relationships", as well as this post.
It'd be easy to put this kind of logic in a stored procedure or the application logic, but I'd really like to keep it in a constraint of some kind if at all possible, just to eliminate any possibility that the logic got bypassed some how.
Your instincts are correct - make database "defend" itself from the bad data as much as possible. Here is the order of preference for ensuring the correctness of your data:
The structure of tables themselves.
The declarative database constraints (integrity of domain, integrity of key and referential integrity).
Triggers and stored procedures.
Middle tier.
Client.
For example, if you can ensure a certain logic must be followed just by using declarative database constraints, don't put it in triggers.

General database design: Is it ever considered "okay" to create a non-normalized table on purpose?

After-edit: Wow, this question go long. Please forgive =\
I am creating a new table consisting of over 30 columns. These columns are largely populated by selections made from dropdown lists and their options are largely logically related. For example, a dropdown labeled Review Period will have options such as Monthly, Semi-Annually, and Yearly. I came up with a workable method to normalize these options down to numeric identifiers by creating a primitives lookup table that stores values such as Monthly, Semi-Annually, and Yearly. I then store the IDs of these primitives in the table of record and use a view to join that table out to my lookup table. With this view in place, the table of record can contain raw data that only the application understands while allowing external applications and admins to run SQL against the view and return data that is translated into friendly information.
It just got complicated. Now these dropdown lists are going to have non-logically-related items. For example, the Review Period dropdown list now needs to have options of NA and Manual. This blows my entire grouping scheme out of the water.
Similar constructs that have been used in this application have resorted to storing repeated string values across multiple records. This means you could have hundreds of records with the string 'Monthly' stored in the table's ReviewPeriod column. The thought of this happening has made me cringe since I've started working here, but now I am starting to think that non-normalized data may be the best option here.
The only other way I can think of doing this using my initial method while allowing it to be dynamic and support the constant adding of new options to any dropdown list at any time is this: When saving the data to the database, iterate through every single property of my business object (.NET class in this case) and check for any string value that exists in the primitives table. If it doesn't, add it and return the auto-generated unique identifier for storage in the table of record. It seems so complicated, but is this what one is to go through for the sake of normalized data?
Anything is possible. Nobody is going to haul you off to denormalization jail and revoke your DBA card. I would say that you should know the rules and what breaking them means. Once you have those in hand, it's up to your and your best judgement to do what you think is best.
I came up with a workable method to normalize these options down to
numeric identifiers by creating a primitives lookup table that stores
values such as Monthly, Semi-Annually, and Yearly. I then store the
IDs of these primitives in the table of record and use a view to join
that table out to my lookup table.
Replacing text with ID numbers has nothing at all to do with normalization. You're describing a choice of surrogate keys over natural keys. Sometimes surrogate keys are a good choice, and sometimes surrogate keys are a bad choice. (More often a bad choice than you might believe.)
This means you could have hundreds of records with the string
'Monthly' stored in the table's ReviewPeriod column. The thought of
this happening has made me cringe since I've started working here, but
now I am starting to think that non-normalized data may be the best
option here.
Storing the string "Monthly" in multiple rows has nothing to do with normalization. (Or with denormalization.) This seems to be related to the notion that normalization means "replace all text with id numbers". Storing text in your database shouldn't make you cringe. VARCHAR(n) is there for a reason.
The only other way I can think of doing this using my initial method
while allowing it to be dynamic and support the constant adding of new
options to any dropdown list at any time is this: When saving the data
to the database, iterate through every single property of my business
object (.NET class in this case) and check for any string value that
exists in the primitives table. If it doesn't, add it and return the
auto-generated unique identifier for storage in the table of record.
Let's think about this informally for a minute.
Foreign keys provide referential integrity. Their purpose is to limit the values allowed in a column. Informally, the referenced table provides a set of valid values. Values that aren't in that table aren't allowed in the referencing column of other tables.
But no matter what the user types in, you're going to add it to that table of valid values.
If you're going to accept everything the user types in the first place, why use a foreign key at all?
The main problem here is that you've been poorly served by the people who taught you (mis-taught you) the relational model. (And, probably, equally poorly by the people who taught you SQL.) I hope you can unlearn those mistaken notions quickly, and soon make real progress.

Is it insecure to reveal a row's primary key to the user?

Why do many applications replace the primary key of a database with a seemingly random alternative id when revealing the record to the user?
My guess is that it prevents users from guessing other rows in the table. If so, isn't that just false sense of security?
I guess you are talking about surrogate keys here. One of the desired or supposed advantages of surrogate keys is that they aren't burdened by any external meaning or dependency on anything outside the database. So for example the surrogate key values can safely be reassigned or the key can be refactored or discarded without any consequences for users of the system.
Generally surrogate keys are kept hidden from users so that they don't acquire any such external dependencies. Being hidden from users was in fact part of the original definition of a surrogate key as proposed by E.F.Codd. If key values reside in the user's browser cache or favourites list then they aren't much use as "surrogates" any more. So that's one common reason why you will see one key used only inside the database and a different key for the same table made visible in the application.
I think it may depend on the type of application you are working with. I work with Enterprise software that is only used by the company I work for and is not generally available to the outside world. In this case, it is often critical to let the user see the surrogate key for people-related records because the information in the person table has no uniqueness. There can be two John Smiths (we actually have over 1000 of them) who are genuinely different people. They may even have the same business address and be different people (Sons are often named for fathers and work in the same medical practice for instance). So they need to refer to the surrogate key on forms and in reporting to ensure they are using the record they thought they wanted. OItherwise if they wanted to research further details about the John Smith that they saw in a report, how would they look it up in the aaplication without having to go through all 1000 to find the right one? Creating a fake id as well as the real one would be time consuming (we import millions of records at a time) and for no real gain since the data would not be visible outside our comapny application.
For a web app that is open to the general public, I can see where you might not want to show this information.

What exactly does database normalization do?

Supposedly normalization reduces redundancy of data and increases performance. What is the reason for dividing the master table into other small tables, applying relationships between them, retrieving the data using all possible unions, subqueries, joins etc.? Why can't we have all the data in a single table and retrieve it as required?
The main reason is to eliminate repetition of data, so for example if you had a user with multiple addresses and you stored this information in a single table the user information would be duplicated along with each address entry. Normalisation would seperate the addresses into their own table and then link the two using keys. This way you wouldn't need to duplicate the user data, and your db structure becomes a little cleaner.
Full normalisation will generally not improve performance, in fact it can often make it worse but it will keep your data duplicate free. In fact in some special cases I've denormalised some specific data in order to get a performance increase.
Normalization comes from the mathematical concept of being "normal." Another word would be "perpendicular." Imagine a regular two-axis coordinate system. Moving up just changes the y coordinate, moving to the side just changes the x coordinate. So every movement can be broken down into a sideways and an up-down movement. These two are independent of each other.
Normalization in database essentially means the same thing: If you change a piece of data, this is supposed to change just one single piece of information in a database. Imagine a database of E-Mails: If you store the ID and the name of the recipient in the Mails table, but the Users table also associates the name to the ID, that means if you change a user name, you don't only have to change it in the users table, but also in every single message that this user is involved with. So, the axis "message" and the axis "user" are not "perpendicular" or "normal."
If on the other hand, the Mails table only has the user ID, any change to the user name will automatically apply to all the messages, because on retrieval of a message, all user information is gathered from the Users table (by means of a join).
Database normalisation is, at its simplest, a way to minimise data redundancy. To achieve that, certain forms of normalisation exist.
First normal form can be summarised as:
no repeating groups in single tables.
separate tables for related information.
all items in a table related to the primary key.
Second normal form adds another restriction, basically that every column not part of a candidate key must be dependent on every candidate key (a candidate key being defined as a minimal set of columns which cannot be duplicated in the table).
And third normal form goes a little further, in that every column not part of a candidate key must not be dependent on any other non-candidate-key column. In other words, it can depend only on the candidate keys. This leads to the saying that 3NF depends on the key, the whole key and nothing but the key, so help me Codd1.
Note that the above explanations are tailored toward your question rather than database theorists, so the descriptions are necessarily simplified (and I've used phrases like "summarised as" and "basically").
The field of database theory is a complex one and, if you truly wish to understand it, you'll eventually have to get to the science behind it. But, in terms of your question, hopefully this will be adequate.
Normalization is a valuable tool in ensuring we don't have redundant data (which becomes a real problem if the two redundant areas get out of sync). It doesn't generally increase performance.
In fact, although all database should start in 3NF, it's sometimes acceptable to drop to 2NF for performance gains, provided you're aware of, and mitigate, the potential problems.
And be aware that there are also "higher" levels of normalisation such as (obviously) fourth, fifth and sixth, but also Boyce-Codd and some others I can't remember off the top of my head. In the vast majority of cases, 3NF should be more than enough.
1 If you don't know who Edgar Codd (or Christopher Date, for that matter) is, you should probably research them, they're the fathers of relational database theory.
We use normalization to reduce the chances of anomalies that may arise as a result of data insertion, deletion, updation. Normalization doesnt necessarily increase performance.
There is much material on internet so i wont repeat the stuff here again. But you can have a look at
Normalization rules
Anomalies
(others aswell)
As well as all the above, it just makes a certain sense. Say you have a user and you want to record what kind of car they have.
Put that all in one table and then you're fine, until someone owns two cars... You're then going to need two rows for that person, and a way of making sure that you can link those two rows together...
And then what if you also want to record how many dogs they have? Same table with lots of confusing dups? Another table with your own custom logic to manage unique users?
Normalization keeps you away from a lot of these problems...

Resources