Storing Composite Product Numbers - database

I am designing a laboratory database. Several products, samples, etc are identified by a composite number with multiple parts which indicate different values such as: origin, date, type, id today, etc. Examples of composite numbers might include a driver's license number (X44-555-3434), lot number (XBR-A26-500-2).
How should composite numbers be stored in a database? Should they be stored as a string or should each component of the composite number be stored (or derived) separately?
NOTE: Use Oracle if the question cannot be answered generally.

In my experience, if there are elements of a string that have meaning, it's best to put them in their own field. The gymnastics we go through trying to tease out the meaning are complex and error-prone; also it is easier to do data validation when we deal with each field explicitly. It's easy to construct the composite string, easy to search on the constructed string (albeit sometimes difficult to index). Fine-grained storage has always worked best for me.

It depends on what you eventually want to query. If you want to look for parts of the string separately, save it separately. If you want to look for the whole code (with or without the dashes, that can be dealt with), store it as a single string.

Also, if you decide to store them in separate columns, it would be a good idea to make a unique key out of them (unique key containing all the columns).

I would store each coded piece of the composite number separately, possibly as the actual value the code is meant to represent (i.e. instead of storing "XBR" store "January 12, 2007" as a timestamp). But that would depend on whether you expect to look up items by coded composite numbers more often, or by their actual semantics.
Then I would keep the mapping between the codes and the actual values somewhere in the database as well. So I'd have a few small tables that store two columns each, one representing codes and one representing, for example, timestamps.

Is the derivation of these numbers likely to change and how would expect that to impact the application ?
For example, years ago car registration plates in Britain started with three letters (which I think indicated a region), followed by three digits, followed by another letter which indicated the year of registration.
Eventually they ran out of letters for the suffix, and switched the ordering round so rather than having "ABC 123 A", you may have "A 123 ABC", and it would have been possible to have two cars registered at the same time, one with "ABC 123 A" and the other "A 123 ABC".
Oh, and then they 'sped' up the use of the single letter year indicator (as lots of people were waiting until the new letter came into effect before buying a car), so it not longer indicated a year.
If you were just interested in registration plates, then you would have been best off storing the registration plate as one value. However the application responsible for issuing the numbers probably needed them broken down into their components.

Related

String or Number Codes for Field Values - What's the Best Practice?

In an application (in my case, a web application with database, server, and frontend layers), if you have a field that has a limited number of possible values, what's the best way to represent that on the database?
For example, say we have an employee status field that can have the following values:
Active
Terminated
Leave of Absence
This field will be transferred between tiers, and logic done on the values for different purposes (say to highlight with different colors on the web page).
So what's the best way to store these values so that a) bugs are less likely but b) is still easy for developers to use?
Should the value be stored as an enumeration (1, 2, 3) corresponding to the different status, or as plain strings, or as some shortened value with no whitespace, like (ACT, TER, LOA), or is this purely a matter of preference? I was leaning towards numbers as you won't run into spelling mistakes, but that has the tradeoff of obfuscation.
Tag suggestions for this topic appreciated.
I think both have PROs and CONS.
Using Numbers:
1. Lesser Size
2. Performance
3. Lesser Error
4. Easy To Code
On the Other hand, when using IDs one major issue would be every application that needs to interact with DB need to know this IDs and its meanings.Primary Example any Reporting tools. Of course, we can negate this impact by creating a definition table for each of such object types. I.e a table with only two columns ID, DESCR and then a simple joint will be sufficient.
Using Strings:
1. Easy to remember
2. Easy for all other Systems directly talking to DB

What is better: make "Date" composite attribute or atomic?

In a scenerio when I need to use the the entire date (i.e. day, month, year) as a whole, and never need to extract either the day, or month, or the year part of the date, in my database application program, what is the best practice:
Making Date an atomic attribute
Making Date a composite attribute (composed of day, month, and year)
Edit:- The question can be generalized as:
Is it a good practice to make composite attributes where possible, even when we need to deal with the attribute as a whole only?
Actually, the specific question and the general question are significantly different, because the specific question refers to dates.
For dates, the component elements aren't really part of the thing you're modelling - a day in history - they're part of the representation of the thing you're modelling - a day in the calendar that you (and most of the people in your country) use.
For dates I'd say it's best to store it in a single date type field.
For the generalized question I would generally store them separately. If you're absolutely sure that you'll only ever need to deal with it as a whole, then you could use a single field. If you think there's a possibility that you'll want to pull out a component for separate use (even just for validation), then store them separately.
With dates specifically, the vast majority of modern databases store and manipulate dates efficiently as a single Date value. Even in situations when you do want to access the individual components of the date I'd recommend you use a single Date field.
You'll almost inevitably need to do some sort of date arithmetic eventually, and most database systems and programming languages give some sort of functionality for manipulating dates. These will be easier to use with a single date variable.
With dates, the entire composite date identifies the primary real world thing you're identifying.
The day / month / year are attributes of that single thing, but only for a particular way of describing it - the western calendar.
However, the same day can be represented in many different ways - the unix epoch, a gregorian calendar, a lunar calendar, in some calendars we're in a completely different year. All of these representations can be different, yet refer to the same individual real world day.
So, from a modelling point of view, and from a database / programmatic efficiency point of view, for dates, store them in a single field as far as possible.
For the generalisation, it's a different question.
Based on experience, I'd store them as separate components. If you were really really sure you'd never ever want to access component information, then yes, one field would be fine. For as long as you're right. But if there's even an ability to break the information up, I peronally would separate them from the start.
It's much easier to join fields together, than to separate fields from a component string. That's both from a programm / algorithmic viewpoint and from compute resource point of view.
Some of the most painful problems I've had in programming have been trying to decompose a single field into component elements. They'd initially been stored as one element, and by the time the business changed enough to realise they needed the components... it had become a decent sized challenge.
Most composite data items aren't like dates. Where a date is a single item, that is sometimes (ok, generally in the western world) represented by a Day-Month-Year composite, most composite data elements actually represent several concrete items, and only the combination of those items truly uniquely represent a particular thing.
For example a bank account number (in New Zealand, anyway) is a bit like this:
A bank number - 2 or 3 digits
A branch number - 4 to 6 digits
An account / customer number - 8 digits
An account type number - 2 or 3 digits.
Each of those elements represents a single real world thing, but together they identify my account.
You could store these as a single field, and it'd largely work. You might decide to use a hyphen to separate the elements, in case you ever needed to.
If you really never need to access a particular piece of that information then you'd be good with storing it as a composite.
But if 3 years down the track one bank decides to charge a higher rate, or need different processing; or if you want to do a regional promotion and could key that on the branch number, now you have a different challenge, and you'll need to pull out that information. We chose hyphens as separators, so you'll have to parse out each row into the component elements to find them. (These days disk is pretty cheap, so if you do this, you'll store them. In the old days it was expensive so you had to decide whether to pay to store it, or pay to re-calculate it each time).
Personally, in the bank account case (and probably the majority of other examples that I can think of) I'd store them separately, and probably set up reference tables to allow validation to happen (e.g. you can't enter a bank that we don't know about).

Naming conventions for non-normalized fields

Is it a common practice to use special naming conventions when you're denormalizing for performance?
For example, let's say you have a customer table with a date_of_birth column. You might then add an age_range column because sometimes it's too expensive to calculate that customer's age range on the fly. However, one could see this getting messy because it's not abundantly clear which values are authoritative and which ones are derived. So maybe you'd want to name that column denormalized_age_range or something.
Is it common to use a special naming convention for these columns? If so, are there established naming conventions for such a thing?
Edit: Here's another, more realistic example of when denormalization would give you a performance gain. This is from a real-life case. Let's say you're writing an app that keeps track of college courses at all the colleges in the US. You need to be able to show, for each degree, how many credits you graduate with if you choose that degree. A degree's credit count is actually ridiculously complicated to calculate and it takes a long time (more than one second per degree). If you have a report comparing 100 different degrees, it wouldn't be practical to calculate the credit count on the fly. What I did when I came across this problem was I added a credit_count column to our degree table and calculated each degree's credit count up front. This solved the performance problem.
I've seen column names use the word "derived" when they represent that kind of value. I haven't seen a generic style guide for other kinds of denormalization.
I should add that in every case I've seen, the derived value is always considered secondary to the data from which it is derived.
In some programming languages, eg Java, variable names with the _ prefix are used for private methods or variables. Private means it should not be modified/invoked by any methods outside the class.
I wonder if this convention can be borrowed in naming derived database columns.
In Postgres, column names can start with _, eg _average_product_price.
It can convey the meaning that you can read this column, but don't write it because it's derived.
I'm in the same situation right now, designing a database schema that can benefit from denormalisation of central values. For example, table partitioning requires the partition key to exist in the table. So even if the data can be retrieved by following some levels of foreign keys, I need the data right there in most tables.
Maybe the suffix "copy" could be used for this. Because after all, the data is just a copy of some other location where the primary data is stored. Since it's a word, it can work with all naming conventions, like .NET PascalCase which can be mapped to SQL snake_case, e. g. CompanyIdCopy and company_id_copy. And it's a short word so you don't have to write too much. And it's not an abbreviation so you don't have to spell it or ever wonder what it means. ;-)
I could also think of the suffix "cache" or "cached" but a cache is usually filled on demand and invalidated some time later, which is usually not the case with denormalised columns. That data should exist at all times and never be outdated or missing.
The word "derived" is just a bit longer than "copy". I know that one special DBMS, an expensive one, has a column name limit of 30 characters, so that could be an issue.
If all of the values required to derive the calculation are in the table already, then it is extremely unlikely that you will gain any meaningful (or even measurable) performance benefit by persisting these calculated values.
I realize this doesn't answer the question directly, but it would seem that the premise is faulty: if such conditions existed for the question to apply, then you don't need to denormalize it to begin with.

Atomicity of field for part numbers

In our internal inventory application, we store three values (in separate fields) that become the printed "part number" in this format: PPP-NNNNN-VVVV (P = Prefix, N = Number, V = version).
So for example, if you have a part 010-00001-01 you know it's version 1 of a part of type "010" (which let's say is a printed circuit board).
So, in the process of creating parts engineering wants to group parts together by keeping the "number" component (the middle 5 digits) the same across multiple prefixes like so:
001-00040-0001 - Overall assembly
010-00040-0001 - PCB
015-00040-0001 - Schematics
This seems problematic and frustrating as it sometimes adds extra meaning to the "number" field (but not consistently since not all parts with the same "number" component are necessarily linked).
Am I being a purist or is this fine? 1NF is awfully vague with regards to atomicity. I think I'm mostly frustrated because of the extra logic to ensure that the next "number" part of the overall part number is valid and available for all prefixes.
There have been a number of enterprises that have foundered, or nearly foundered, on the "part number syndrome". You might be able to find some case studies. DEC part numbers were somewhat mixed up.
The customer is not always right, but the customer is always the customer.
In this case, it sounds to me like engineering is trying to use as single number to model a relationship. I mean the relationship between Overall assembly, PCB, and Scematics. It's better to model relationships as relations. It allows you more flexibility down the road. You may have a hard time selling engineering on this point.
In my experience, regardless of database normative rules, when the client/customer/user wants something done a certain way, there is most likely a reason for it, and that reason will save them money (in some fashion). Sometimes it will save money by reducing steps, by reducing training costs, or simply because That's The Way It's Always Been. Whatever the reason, eventually you'll end up doing it because they're paying to have it done (unless it violates accounting rules).
In this instance, it sounds like an extra sorting criteria on some queries for reports, and a new 'allocated number' table with an auto-incrementing key. That doesn't sound too bad to me. Ask me sometime about the database report a client VP commissioned strictly to cast data in such a fashion as to make a different VP look bad in meetings (not that he told me that up front).

IDs for Information on More Than One DB/Server

I'm working on a project that I want to have be as flexible and scalable as possible from the beginning. A problem I'm concerned about is one best described by Joshua Schacter in Founders at Work, who noted it as one detail he wish he would've planned for ahead of time.
Scaling past one machine, one database, is very challenging, even with replication. The tools that are there are not quite right.
For example, when you add things to a table and it numbers them, that means you can't have a second machine also adding to them because the numbers will collide. So what do you do? You have to come up with some completely different way to do it.
Do you have a central server that hands out number sets, or do you come up with something that's not numbers? Do you use random numbers and hope they never collide? Whatever it is, auto-assigned IDs just don't fly.
Has anyone here faced this problem? What are ways to move beyond auto-incremented IDs, or is there a way to have them scale with multiple servers?
Use GUID/UUID (globally/universally unique identifier). In theory it's guaranteed to be unique across multiple machines.
GUIDs, your chances of collision are astronomically low.
It's also possible to have (what we called) SmartGUIDs (usually called COMB GUIDS - see this analysis, particularly page 7) where you can encode a timestamp within the GUID, so you get record creation date information "for free" - so you can save a timestamp column for record creation datetime - which gets back some of what you lost on moving from 32-bit integer to 128-bit GUID. These can also be guaranteed to be monotonic, unlike regular GUIDs, which can be useful for clustered indexes and for sorting.
You can also use composite keys with some kind of server/db ID with a regular auto-increment identity or auto-number.

Resources