DB Schema Organization

DB Schema Organization - database

I'm currently in the planning phase of building a scheduling web app (for volunteer staffing of events), and I've got a question for those with more experience.
Background:
There's a calendar of events, and any user at any time can register for any of the events. At a later time, but before that event, one of the admins will step in and select a "Staff List" out of those that registered, and the rest will be put into an "Alternate List".
What I've been thinking so far is that there will be an Event table, a User table, and then three others:
UserEvent
Maps users to events they registered to. Does not imply either the Staff nor the Alt list membership.
UserStaff
Maps users to events they registered to, and also happen to be staffing.
UserAlt
Similar to UserStaff
The question then becomes two part:
Is this a good way to do it?
Should each of those three associative tables have the user id and the event id?
That second question is really the one I'd like to see discussed. That seems like a lot of duplicated material (everything in either UserStaff or UserAlt will always be in UserEvent), so I was thinking of creating a unique key for the UserEvent table, in addition to the composite key, that the other tables (UserStaff and UserAlt) will refer to. On the plus side, there is less duplicated content, on the down side there's an intermediary table (UserEvent) that needs to be referenced in almost every query this way.
Hopefully I've been clear enough, and thanks in advance.

I would have the following tables:
User (UserID, firstname, lastname, etc.)
Event (EventID, Name, Date, Location, Capacity, etc.)
EventRegistration (EventRegistrationID, UserID, EventID, ParticipantTypeID, etc.)
ParticipantType (ParticipantTypeID, Name)
ParticipantType.Name is one of "participant" or "staff".

This seems good, although you might want to consider combining your User - Event association tables into one, and having a column on that table that indicates the purpose of the association, i.e. Event, Staff, or Alt. This would effectively obviate the need for the duplication you describe in the UserEvent tables, since Staff and Alt could be considered to be supersets of Event for most purposes.
One benefit of this approach is that it allows for there to be multiple types of User - Event associations, such as if you have a User who is a Staffer for an Event but not a Participant, or a User who is just an Alt; this approach saves you from having to enumerate all the possible combinations. Now, if your design explicitly specifies that you can only have a certain set of User Participation types, this might introduce a level of dissociation you don't want; you may prefer to have explicit constraints on the set of participation levels that a User may have on an Event. If you don't have that tightly specified set, on the other hand, this system allows for adding more Participation roles easily (and without disturbing existing Participation roles).

Not a direct answer to your question, but here's a site I like. It's got tons (and tons) of sample schema. I generally don't use it as definitive (of course), but sometimes it will give me an idea on something I wasn't thinking of.

Related

Is there a pattern to avoid ever-multiplying link tables in database design?

Currently scoping out a new system. Like many systems, it will be required to store documents and link them to other kinds of item. In this instance a Document object can belong to a Job or it can belong to an Item (which in turn belongs to a Job).
We could do this by having a JobId and an ItemId against a Document and leaving one or the other blank if necessary, but that's going to mean annoying conditional logic in the handling code. So, two link tables seems a better idea.
However, it is likely that we will need to link Documents to other items in the system at some point in the future. There are Company and User objects, for example, and we might want to record Documents against those. There may be more.
That would entail a proliferation of link tables which, while effective, is messy and hard to follow.
This solution is in SQL Server and will be handled in code via Entity Framework.
Are there any design principles that can allow us to hook up Document objects with a variety of other system objects as required in a neater and more flexible way?

You could store two values: the id, and the type of object to which the document is attached. It doesn't allow the use of foreign keys, but is compatible with many application development frameworks.
If you have the partitioning option then you could dedicate different partitions to different object types.
You could also have multiple tables, one for job documents, one for item documents, and get an overview of all of them with a view that UNION ALL's them together. If you need uniqueness in that result set then you could use UUIDs for the primary key, or add an extra column to the view to express from which table the row was read.

Database design - system default items and custom user items

This question applies to any database table design, where you would have system default items and custom user defaults of the same type (ie user can add his own custom items/settings).
Here is an example of invoicing and paymenttypes, By default an invoice can have payment terms of DueOnReceipt, NET10, NET15, NET30 (this is the default for all users!) therefore you would have two tables "INVOICE" and "PAYMENT_TERM"
INVOICE
Id
...
PaymentTermId
PAYMENT_TERM (System default)
Id
Name
Now what is the best way to allow a user to store their own custom "PaymentTerms" and why? (ie user can use system default payment terms OR user's own custom payment terms that he created/added)
Option 1) Add UserId to PaymentTerm, set userid for the user that has added the custom item and system default userid set to null.
INVOICE
Id
...
PaymentTermId
PaymentTerm
Id
Name
UserId (System Default, UserId=null)
Option 2) Add a flag to Invoice "IsPaymentTermCustom" and Create a custom table "PAYMENT_TERM_CUSTOM"
INVOICE
Id
...
PaymentTermId
PaymentTermCustomId
IsPaymentTermCustom (True for custom, otherwise false for system default)
PaymentTerm
Id
Name
PAYMENT_TERM_CUSTOM
Id
Name
UserId
Now check via SQL query if the user is using a custom payment term or not, if IsPaymentTermCustom=True, it means the user is using custom payment term otherwise its false.
Option 3) ????
...

As a general rule:
Prefer adding columns to adding tables
Prefer adding rows to adding columns
Generally speaking, the considerations are:
Effects of adding a table
Requires the most changes to the app: You're supporting a new kind of "thing"
Requires more complicated SQL: You'll have to join to it somehow
May require changes to other tables to add a foreign key column referencing the new table
Impacts performance because more I/O is needed to join to and read from the new table
Note that I am not saying "never add tables". Just know the costs.
Effects of adding a column
Can be expensive to add a column if the table is large (can take hours for the ALTER TABLE ADD COLUMN to complete and during this time the table wil be locked, effectively bringing your site "down"), but this is a one-time thing
The cost to the project is low: Easy to code/maintain
Usually requires minimal changes to the app - it's a new aspect of a thing, rather than a new thing
Will perform with negligible performance difference. Will not be measurably worse, but may be a lot faster depending on the situation (if having the new column avoids joining or expensive calculations).
Effects of adding rows
Zero: If your data model can handle your new business idea by just adding more rows, that's the best option
(Pedants kindly refrain from making comments such as "there is no such thing as 'zero' impact", or "but there will still be more disk used for more rows" etc - I'm talking about material impact to the DB/project/code)
To answer the question: Option 1 is best (i.e. add a column to the payment option table).
The reasoning is based on the guidelines above and this situation is a good fit for those guidelines.
Further,
I would also store "standard" payment options in the same table, but with a NULL userid; that way you only have to add new payment options when you really have one, rather than for every customer even if they use a standard one.
It also means your invoice table does not need changing, which is a good thing - it means minimal impact to that part of your app.

It seems to me that there are merely "Payment Terms" and "Users". The decision of what are the "Default" payment terms is a business rule, and therefore would be best represented in the business layer of your application.
Assuming that you would like to have a set of pre-defined "default" payment terms present in your application from the start, these would already be present in the payment terms table. However, I would put a reference table in between USERS and PAYMENT TERMS:
USERS:
user-id
user_namde
USER_PAYMENT_TERMS:
userID
payment_term_id
PAYMENT_TERMS:
payment_term_id
payment_term
Your business layer should offer up to the user (or more likely, the administrator) through a GUI the ability to:
Assign 0 to many payment term options to a particular user (some
users may not want one of the defaults to even be available, for
example.
Add custom payment terms, which then become available for assignment to one or more users (but which avoids the creation of duplicate payment terms by different users)
Allows the definition of a custom payment term to be assigned to more than one user (say the user's company a unique payment process which requires all of their users to utilize a payment term other than one of the defaults? Create the custom term once, and assign to all users.
Your application business layer would establish rules governing access to payment terms, which could then be accessed by your user interface.
Your UI would then (again, likely through an administrator function) allow the set up of one or more payment terms in addition to the standards you describe, and then make them available to one or more users through something like a checked list box (for example).

Option 1 is definately better for the following reasons:-
Correctness
You can implement a database constraint for uniqueness of the payment term name
You can implement a foreign key constraint from Invoice to PaymentTerm
Ease of Use
Conducting queries will be much simplier because you will always join from Invoice to PaymentTerm rather than requiring a more complex join. Most of the time when you select you will not care if it is an inbuilt or custom payment term. The optimizer will have an easier time with a normal join instead of one that depends on another column to decide which table to join.
Easier to display a list of PaymentTerms coming from one table
We use Option 1 in our data-model quite alot.

Part of the problem, as I see it, is that different payment terms lead to different calculations, too. If I were still in the welding supply business, I'd want to add "2% 10 NET 30", which would mean 2% discount if the payment is made in full within 10 days, otherwise, net 30."
Setting that issue aside, I think ownership of the payment terms makes sense. Assume that the table of users (not shown) includes the user "system" as, say, user_id 0.
create table payment_terms (
payment_term_id integer primary key,
payment_term_owner_id integer not null references users (user_id),
payment_term_desc varchar(30) not null unique,
);
insert into payment_terms values (1, 0, 'Net 10');
insert into payment_terms values (2, 0, 'Net 15');
...
insert into payment_terms values (5, 1, '2% 10, Net 30');
This keeps foreign keys simple, and it makes it easy to select payment terms at run time for presentation in the user interface.
Be very careful here. You probably want to store the description, not the ID number, with your invoices. (It's unique; you can set a foreign key reference to it.) If you store only the ID number, updating a user's custom description might subtly corrupt all the data that references it.
For example, let's say that the user created a custom payment term number 5, '2% 10, Net 30'. You store the ID number 5 in your table of invoices. Then the user decides that things will be different starting today, and updates that description to '2% 10, Net 20'. Now on all your past invoices, the arithmetic no longer matches the payment terms.
Your auditor will kill you. Twice.
You'll want to prevent ordinary users from deleting rows owned by the system user. There are several ways to do that.
Use a BEFORE DELETE trigger.
Add another table with foreign key references to the rows owned by the system user.
Restrict all access through stored procedures that prevent deleting system rows.
(And flags are almost never the best idea.)

Applying general rules of database design to the problem at hand:
one table for system payment terms
one table for user payment terms
a view of join of the two above
Now you can join invoice on the view of payment terms.
Benefits:
No flag columns
No nulls
You separate system defaults from user data
Things become straight forward for the db

Is it insecure to reveal a row's primary key to the user?

Why do many applications replace the primary key of a database with a seemingly random alternative id when revealing the record to the user?
My guess is that it prevents users from guessing other rows in the table. If so, isn't that just false sense of security?

I guess you are talking about surrogate keys here. One of the desired or supposed advantages of surrogate keys is that they aren't burdened by any external meaning or dependency on anything outside the database. So for example the surrogate key values can safely be reassigned or the key can be refactored or discarded without any consequences for users of the system.
Generally surrogate keys are kept hidden from users so that they don't acquire any such external dependencies. Being hidden from users was in fact part of the original definition of a surrogate key as proposed by E.F.Codd. If key values reside in the user's browser cache or favourites list then they aren't much use as "surrogates" any more. So that's one common reason why you will see one key used only inside the database and a different key for the same table made visible in the application.

I think it may depend on the type of application you are working with. I work with Enterprise software that is only used by the company I work for and is not generally available to the outside world. In this case, it is often critical to let the user see the surrogate key for people-related records because the information in the person table has no uniqueness. There can be two John Smiths (we actually have over 1000 of them) who are genuinely different people. They may even have the same business address and be different people (Sons are often named for fathers and work in the same medical practice for instance). So they need to refer to the surrogate key on forms and in reporting to ensure they are using the record they thought they wanted. OItherwise if they wanted to research further details about the John Smith that they saw in a report, how would they look it up in the aaplication without having to go through all 1000 to find the right one? Creating a fake id as well as the real one would be time consuming (we import millions of records at a time) and for no real gain since the data would not be visible outside our comapny application.
For a web app that is open to the general public, I can see where you might not want to show this information.

Looking for Denormalization Advice for Google App Engine

I am working on a system, which will run on GAE, which will have several related entities and I am not sure of the best way to store the data. This post is a request for advice from others who may have similar experience....
The system will have users, with profile data and an image. Those users will be able to create "events" and add journal entries to it. For the purpose of the system, the "events" will likely have 1 or 2 journal entries in them, and anything over 10 would likely never happen. Other users will be able to add comments to users' entries as well, where popular ones may have hundreds or even thousands of comments. When a random visitor uses the system, they should be able to see the latest events (latest, being defined by those with latest journal entries in them), search by tag, and a very perform basic text search. Then upon selecting an event to view, it should be displayed with all journal entries, and all user comments, with user images alongside comments. A user should also have a kind of self-admin page, to view/modify/delete their events and to view/modify/delete comments they have made on other events. So, doing all this on a normal RDBMS would just queries with some big joins across several tables. On GAE it would obviously need to work differently. Here are my initial thoughts on the design of the entities:
Event entity - id, name, timstamp, list
property of tags, view count,
creator's username, creator's profile
image id, number of journal entries
it contains, number of total comments
it contains, timestamp of last update to contained journal entries, list property of index words for search (built/updated from text from contained journal entries)
JournalEntry entity - timestamp,
journal text, name of event,
creator's username, creator's profile
image id, list property of comments
(containing commenter username and
image id)
User entity - username, password hash, email, list property of subscribed events, timestamp of create date, image id, number of comments posted, number of events created, number of journal entries created, timestamp of last journal activity
UserComment entity - username, id of event commented on, title of event commented on
TagData entity - tag name, count of events with tag on them
So, I'd like to hear what people here think about the design and what changes should be made to help it scale well. Thanks!

Rather than store Event.id as a property, use the id automatically embedded in each entity's key, or set unique key names on entities as you create them.
You have lots of options for modeling the relationship between Event and JournalEntry: you could use a ReferenceProperty, you could parent JournalEntries to Events and retrieve them with ancestor queries, or you could store a list of JournalEntry key ids or names on Event and retrieve them in batch with a key query. Try some things out with realistically-distributed dummy data, and use appstats to see what works best.
UserComment references an Event, while JournalEntry references a list of UserComments, which is a little confusing. Is there a relationship between UserComment and JournalEntry? or just between UserComment and Event?
Persisting so many counts is expensive. When I post a comment, you're going to write a new UserComment entity and also update my User entity and a JournalEntry entity and an Event entity. The number of UserComments you expect per Event makes it unwise to include everything in the same entity group, which means you can't do these writes transactionally, so you'll do them serially, and the entities might be stored across different network nodes, making the whole operation slow; and you'll also be open to consistency problems. Can you do without some of these counts and consider storing others in memcache?
When you fetch an Event from the datastore, you don't actually care about its list of search index words, and retrieving and deserializing them from protocol buffers has a cost. You can get around this by splitting each Event's search index words into a separate child EventIndex entity. Then you can query EventIndex on your search term, fetch just the EventIndex keys for EventIndexes that match your search, derive the corresponding Events' keys with key.parent(), and fetch the Events by key, never paying for the retrieval or deserialization of your search index word lists. Brett Slatkin explains this strategy here at 14:35.
Updating Event.viewCount will fail if you have a lot of views for any Event in rapid succession, so you should try out counter sharding.
Good luck, and tell us what you learn by trying stuff out.

Should I expose a user ID to public?

I have a form that reveals user IDs to public. I was wondering that is this dangerous. Personally I do not see anything bad about it. The ID is just used to reference a single database record.

If it were dangerous, Stack Overflow wouldn't be displaying user IDs in their URLs in order to make user profile lookups work: https://stackoverflow.com/users/104826/rfactor
Edit of seriousness of immense levels: if user IDs are themselves sensitive data; for example your primary keys for some reason happen to be social security numbers, that'll definitely be a security and privacy liability. If your user IDs are just auto-increment numbers though, you're clear.

Generally it's not a problem but it can give away hints on how active your site is, like how many users you have etc. If you consider this sensitive information or maybe even good marketing is completely up to you.
There's a story that this was one of the reasons the germans lost the WW2. They had sequential serial numbers from production written on each tank. By collecting id numbers from tanks taken out the british could estimate how many tanks the whole german army had and make new strategies from that.

I have found that exposing primary keys that identify physical entities can create headaches.
Imagine if two blood samples come into a laboratory and test results are generated for each sample. Many different kinds of test might be done and each record representing a test result will have the sample_id as a foreign key.
If you share the database ID with the customer and you discover that two samples were accidentally switched, you will have to update the foreign keys in all the detail records representing the tests. If you instead exposed some other unique name outside your system, you will just need to switch the two unique names on the sample records in the master table.
There are other advantages related to data migration and there are advantages when entities are represented in more than one database in which it is difficult to create records with identical database ID's.
In my experience it is always best to expose a unique identifier other than the primary key outside your system. It gives you more flexibility in resolving data mix-ups, dealing with data migration issues, and in otherwise future-proofing your system.

as For me ID is as dangerous as showing user name.

Exposing an user ID is not, in and of itself, bad. It depends on the level of privacy and security needed. If the user ID does not expose and cannot be tied to any other personal data that should otherwise be private, it may not be a problem.
But don't think that public user IDs can never be a problem.
Make sure you don't allow anyone to break in to any private data just by knowing user IDs. Facebook has had problems like that. Here's just one example. While revealing user IDs wasn't the whole story, it was part of the equation.

Will it hurt anything? Only you can decide that, and you should think that through.
But in general, it is poor form to display the User ID without having a business reason to do so. (Saves you work is probably not a good business reason.)

If it is a generated database id with no other meaning, it's not dangerous. Though I don't think revealing an id is elegant either. It's a technical detail and I can't understand why you would like to show it to users.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight