What is better: make "Date" composite attribute or atomic? - database

In a scenerio when I need to use the the entire date (i.e. day, month, year) as a whole, and never need to extract either the day, or month, or the year part of the date, in my database application program, what is the best practice:
Making Date an atomic attribute
Making Date a composite attribute (composed of day, month, and year)
Edit:- The question can be generalized as:
Is it a good practice to make composite attributes where possible, even when we need to deal with the attribute as a whole only?

Actually, the specific question and the general question are significantly different, because the specific question refers to dates.
For dates, the component elements aren't really part of the thing you're modelling - a day in history - they're part of the representation of the thing you're modelling - a day in the calendar that you (and most of the people in your country) use.
For dates I'd say it's best to store it in a single date type field.
For the generalized question I would generally store them separately. If you're absolutely sure that you'll only ever need to deal with it as a whole, then you could use a single field. If you think there's a possibility that you'll want to pull out a component for separate use (even just for validation), then store them separately.
With dates specifically, the vast majority of modern databases store and manipulate dates efficiently as a single Date value. Even in situations when you do want to access the individual components of the date I'd recommend you use a single Date field.
You'll almost inevitably need to do some sort of date arithmetic eventually, and most database systems and programming languages give some sort of functionality for manipulating dates. These will be easier to use with a single date variable.
With dates, the entire composite date identifies the primary real world thing you're identifying.
The day / month / year are attributes of that single thing, but only for a particular way of describing it - the western calendar.
However, the same day can be represented in many different ways - the unix epoch, a gregorian calendar, a lunar calendar, in some calendars we're in a completely different year. All of these representations can be different, yet refer to the same individual real world day.
So, from a modelling point of view, and from a database / programmatic efficiency point of view, for dates, store them in a single field as far as possible.
For the generalisation, it's a different question.
Based on experience, I'd store them as separate components. If you were really really sure you'd never ever want to access component information, then yes, one field would be fine. For as long as you're right. But if there's even an ability to break the information up, I peronally would separate them from the start.
It's much easier to join fields together, than to separate fields from a component string. That's both from a programm / algorithmic viewpoint and from compute resource point of view.
Some of the most painful problems I've had in programming have been trying to decompose a single field into component elements. They'd initially been stored as one element, and by the time the business changed enough to realise they needed the components... it had become a decent sized challenge.
Most composite data items aren't like dates. Where a date is a single item, that is sometimes (ok, generally in the western world) represented by a Day-Month-Year composite, most composite data elements actually represent several concrete items, and only the combination of those items truly uniquely represent a particular thing.
For example a bank account number (in New Zealand, anyway) is a bit like this:
A bank number - 2 or 3 digits
A branch number - 4 to 6 digits
An account / customer number - 8 digits
An account type number - 2 or 3 digits.
Each of those elements represents a single real world thing, but together they identify my account.
You could store these as a single field, and it'd largely work. You might decide to use a hyphen to separate the elements, in case you ever needed to.
If you really never need to access a particular piece of that information then you'd be good with storing it as a composite.
But if 3 years down the track one bank decides to charge a higher rate, or need different processing; or if you want to do a regional promotion and could key that on the branch number, now you have a different challenge, and you'll need to pull out that information. We chose hyphens as separators, so you'll have to parse out each row into the component elements to find them. (These days disk is pretty cheap, so if you do this, you'll store them. In the old days it was expensive so you had to decide whether to pay to store it, or pay to re-calculate it each time).
Personally, in the bank account case (and probably the majority of other examples that I can think of) I'd store them separately, and probably set up reference tables to allow validation to happen (e.g. you can't enter a bank that we don't know about).

Related

How to handle multiple, fixed values in a single database column (Database-design)

I have a table to keep flights and i want to keep which days of the week this flight operates.
There is no need for date for this since i only need day names.
Firstly i thought to have a column in the flight table that will keep a single string with the day names inside and use my application logic to unravel the information.
This seems ok since the only operation on the days will be to retrieve them.
The thing is, i don't find this is "clean" enough so i thought of making a separate table to keep all 7 day names and a many to many (auto generated) table to keep the flight_id and day_id.
Still though, there are only 7 set values on days table and i am not so sure for the second approach either.
What i would like is some other opinions on how to handle this.
A flight can operate on many different days of a week
Only day names are needed - so, 7 in total.
Sorry for bad English and if this is a trivial question for some. I am not too experienced in both English language and databases.
Some databases support arrays. PostgreSQL for example supports arrays.
You could store the days in an array of integers and use a function to tanslate integers to day names. You could also use an array of a custom enum type (PostgreSQL Example).

Efficiently retrieving entities that match any element of a set of ids

I'm writing software to provide feedback to people across many categories. For example, I might have 30 employees and 40 standards of evaluation (e.g. "arrives on time," "is polite," "appears to brush his teeth," etc). At arbitrary times, the supervisor can submit a piece of feedback like "employee 3 gets a 5/5 for standard 8 (he smells great)" or "employee 10 gets a 1/5 for standard 12 (he just called a customer an idiot)."
My idea is to store these small pieces of feedback individually, linked to the employee and standard by keeping userId and standardId fields.
The problem comes when I want to look at the feedback for all 30 employees and 40 standards. My current approach requires 1200 queries to retrieve all of that data. I'm looking for a better way. I'm using the google appengine datastore, which is a non-relational db.
Things I've been thinking about, and on which I welcome feedback:
I could store the feedback in a grid, with a row per user and column per standard. Then, a single query gets all of the data (better than 1200), but entering new data becomes more difficult (fetch the grid, update the correct bit, store the grid) and changes in the user set or standard set become much more complex (if I add a standard in the middle, this grid needs to be updated). Also, some queries become much harder - I can no longer easily search for the assessments entered on a certain date or by a certain supervisor.
I could store all of the feedback for a certain (set of users x set of standards) in an unorganized list, fetch it with a single query, and then sort it out in my own code. This requires me to loop through 1200 entries, but that would be faster than 1200 queries over all of the data in the whole system (there may be many, many irrelevant data for other sets of users and unrelated standards).
So, the short version of my question is: how should I store this data for the best balance of quick retrieval of a large subset and quick insertion of individual pieces of feedback?
You might be able to do this using a RelationIndex. Depending on how exactly you will want to allow user to view and query the data, it should work.
The idea is pretty straight forward, basically you will store a list of "standards" for each employee. And possibly a list of employee's for each standard. Then you'll be able to ask questions such as all employee's who 'smell good'.
Because you have scores for each standard, you might want to do something like store the "score" and "standard number" as a pair in the list ("3:12") so that you can find everyone who has a score of 3 on standard 12.
edit: Updated based on comment.
It sounds like you need to deal with a few different issues. First, you need to deal with editing and maintaining the data. Second, you need to deal with querying the data. Third, you are going to need to handle displaying the data.
For querying the data efficiently you will probably need some approach similar to what I initially suggested. What is more common, editing or viewing the data? That will impact how you setup your models.
If you are only dealing with 30 or 40 employees and 30 or 40 standards, maybe you could use something like the following:
class Evaluations(db.Model):
period = db.StringProperty()
standards = db.TextProperty()
scores = db.TextProperty()
class EvaluationsIndex(db.Model):
index = db.StringListProperty()
Use the standards property on Evaluations to store a list of standards evaluated. Then store your employee-standard-score grid in the scores property. Obviously you'll need to serialize both the standards list and the evaluation grid, perhaps using something like JSON. Use the EvaluationsIndex model as I mentioned above.
With this (or something really similar) you will have pretty easy edits, very easy display, and support for queries.
You could add an additional model to track which supervisor entered the evaluation and her notes.

Atomicity of field for part numbers

In our internal inventory application, we store three values (in separate fields) that become the printed "part number" in this format: PPP-NNNNN-VVVV (P = Prefix, N = Number, V = version).
So for example, if you have a part 010-00001-01 you know it's version 1 of a part of type "010" (which let's say is a printed circuit board).
So, in the process of creating parts engineering wants to group parts together by keeping the "number" component (the middle 5 digits) the same across multiple prefixes like so:
001-00040-0001 - Overall assembly
010-00040-0001 - PCB
015-00040-0001 - Schematics
This seems problematic and frustrating as it sometimes adds extra meaning to the "number" field (but not consistently since not all parts with the same "number" component are necessarily linked).
Am I being a purist or is this fine? 1NF is awfully vague with regards to atomicity. I think I'm mostly frustrated because of the extra logic to ensure that the next "number" part of the overall part number is valid and available for all prefixes.
There have been a number of enterprises that have foundered, or nearly foundered, on the "part number syndrome". You might be able to find some case studies. DEC part numbers were somewhat mixed up.
The customer is not always right, but the customer is always the customer.
In this case, it sounds to me like engineering is trying to use as single number to model a relationship. I mean the relationship between Overall assembly, PCB, and Scematics. It's better to model relationships as relations. It allows you more flexibility down the road. You may have a hard time selling engineering on this point.
In my experience, regardless of database normative rules, when the client/customer/user wants something done a certain way, there is most likely a reason for it, and that reason will save them money (in some fashion). Sometimes it will save money by reducing steps, by reducing training costs, or simply because That's The Way It's Always Been. Whatever the reason, eventually you'll end up doing it because they're paying to have it done (unless it violates accounting rules).
In this instance, it sounds like an extra sorting criteria on some queries for reports, and a new 'allocated number' table with an auto-incrementing key. That doesn't sound too bad to me. Ask me sometime about the database report a client VP commissioned strictly to cast data in such a fashion as to make a different VP look bad in meetings (not that he told me that up front).

Strategy in storing ad-hoc numbers/constants?

I have a need to store a number of ad-hoc figures and constants for calculation.
These numbers change periodically but they are different type of values. One might be a balance, a money amount, another might be an interest rate, and yet another might be a ratio of some kind.
These numbers are then used in a calculation that involve other more structured figures.
I'm not certain what the best way to store these in a relational DB is - that's the choice of storage for the app.
One way, I've done before, is to create a very generic table that stores the values as text. I might store the data type along with it but the consumer knows what type it is so, in situations I didn't even need to store the data type. This kind of works fine but I am not very fond of the solution.
Should I break down each of the numbers into specific categories and create tables that way? For example, create Rates table, and Balances table, etc.?
Yes, you should definitely structure your database accordingly. Having a generic table holding text values is not a great solution, and it also adds overhead when using those values in programs that may pull that data for some calculations.
Keeping each of the tables and values separated allows you to do things like adding dates and statuses to your values (perhaps some are active while others aren't?) and also allows you to keep an accurate history (what if i want to see a particular rate from last year?). It also makes things easier for those who come behind you to sift through your data.
I suggest reading this article on database normalization.

Storing Composite Product Numbers

I am designing a laboratory database. Several products, samples, etc are identified by a composite number with multiple parts which indicate different values such as: origin, date, type, id today, etc. Examples of composite numbers might include a driver's license number (X44-555-3434), lot number (XBR-A26-500-2).
How should composite numbers be stored in a database? Should they be stored as a string or should each component of the composite number be stored (or derived) separately?
NOTE: Use Oracle if the question cannot be answered generally.
In my experience, if there are elements of a string that have meaning, it's best to put them in their own field. The gymnastics we go through trying to tease out the meaning are complex and error-prone; also it is easier to do data validation when we deal with each field explicitly. It's easy to construct the composite string, easy to search on the constructed string (albeit sometimes difficult to index). Fine-grained storage has always worked best for me.
It depends on what you eventually want to query. If you want to look for parts of the string separately, save it separately. If you want to look for the whole code (with or without the dashes, that can be dealt with), store it as a single string.
Also, if you decide to store them in separate columns, it would be a good idea to make a unique key out of them (unique key containing all the columns).
I would store each coded piece of the composite number separately, possibly as the actual value the code is meant to represent (i.e. instead of storing "XBR" store "January 12, 2007" as a timestamp). But that would depend on whether you expect to look up items by coded composite numbers more often, or by their actual semantics.
Then I would keep the mapping between the codes and the actual values somewhere in the database as well. So I'd have a few small tables that store two columns each, one representing codes and one representing, for example, timestamps.
Is the derivation of these numbers likely to change and how would expect that to impact the application ?
For example, years ago car registration plates in Britain started with three letters (which I think indicated a region), followed by three digits, followed by another letter which indicated the year of registration.
Eventually they ran out of letters for the suffix, and switched the ordering round so rather than having "ABC 123 A", you may have "A 123 ABC", and it would have been possible to have two cars registered at the same time, one with "ABC 123 A" and the other "A 123 ABC".
Oh, and then they 'sped' up the use of the single letter year indicator (as lots of people were waiting until the new letter came into effect before buying a car), so it not longer indicated a year.
If you were just interested in registration plates, then you would have been best off storing the registration plate as one value. However the application responsible for issuing the numbers probably needed them broken down into their components.

Resources