Using hidden and open keys for data tables - database

Not a coding question per se. just trying to get some best practices suggestions.
I believe it is best practice to have a key for every data table. I've read recently that a key should NOT be an integer (auto number, etc.) because it makes hacking a slight bit easier.
So, I've started to use Guid for keys.
Now, should I use a human-readable ID as well?
Example, for a table of departments in an organization: Human Resources, Information Technology, etc. should I also include a column for the ID such as HR, IT, etc.
What are your thoughts?
PS. I realize this is a simple example for discussion sake. Thanks

Related

Database design, multiple M-M tables or just one?

Today I was designing a database for a potential personal project of mine. Since I couldn't decide what would be a better option I asked my teacher Databases, unfortunately he couldn't tell me which of the two options is better than the other and why.
I designed the database for a dummy data generator. Since I want to generate multilangual data I thought of these tables. (But its a simplification of the tables).
(first and last)names: id, name
streets: id, name
languages: id, name
Each names.name and streets.name originates from a language, sometimes a name can have multiple origins (ex: Nick is both a Dutch as an English name).
Each language has multiple names and streets.
These two rules result in a Many-to-Many relationship. At the moment I've got only two tables, but I know I will get between 10 and 20 of these kind of tables.
The regular way one would do this is just make 10 to 20 Many-to-Many relationship tables.
Another idea I came up with was just one Many-to-Many table with a third column which specifies which table the id relates to.
At the moment I've got the design on my other PC so I will update it with my ideas visualized after dinner (2 hours or so).
Which idea is better and why?
To make the project idea a bit clearer:
It is always a hassle to create good and enough realistic looking working data for projects. This application will generate this data for you and return the needed SQL so you only have to run the queries.
The user comes to the site to get the data. He states his tablename, his columnnames and then he can link the columnnames to types of data, think of:
* Firstname
* Lastname
* Email adress (which will be randomly generated from the name of the person)
* Adress details (street, housenumber, zipcode, place, country)
* A lot more
Then, after linking columns with the types the user can set the number of rows he wants to make. The application will then choose a country at random and generate realistic looking data according to the country they live in.
That's actually an excellent question. This sort of thing leads to a genuine problem in database design and there is a real tradeoff. I don't know what rdbms you are using but....
Basically you have four choices, all of them with serious downsides:
1. One M-M table with check constraints that only one fkey can be filled in besides language and one column per potential table. Ick....
2. One M-M table per relationship. This makes things quite hard to manage over time especially if you need to change something from an int to a bigint at some point.
3. One M-M table with a polymorphic relationship. You lose a lot of referential integrity checks when you do this and to make it safe, have fun coding (and testing!) triggers.
4. Look carefully at the advanced features in your rdbms for a solution. For example in postgresql this can be solved with table inheritance. The downside is that you lose portability and end up in advanced territory.
Unfortunately there is no single definite answer. You need to consider the tradeoffs carefully and decide what makes sense for your project. If I was just working with one RDBMS, I would do the last one. But if not, I would probably do one table per relationship and focus on tooling to manage the problems that come up. But the former preference is about my level of knowledge and confidence, and the latter is a bit more of a personal opinion.
So I hope this helps you look at the tradeoffs and select what is right for you.

Reverse index implementation with spring-data and postgres

I have a table, say orders, which has a column, say a alphanumeric 15 character long itemId and a bunch of other columns. The same itemId can be repeated say up to 900 times for very popular items, which means the data will be repeated about 900 times. Obviously, we need to separate it out. However, we need the lookup for the list of items to be very quick and efficient. I read up a bit and thought reverse indexing would be a good way to achieve this. However, I am a bit confused on the actual implementation. I couldn't find any examples online as well other than http://blog.frankel.ch/tag/spring-data , but it uses solr. I was thinking of creating a items-orders table, adding a repository class which will have a method to . However, since there is many-many relation between items and orders, it will require a join table. This makes me think that i am on the wrong path as i intended items-orders table itself as a kind of join table as it only as itemId and orderId in it.
I am pretty sure I am doing something wrong. Any pointers are greatly appreciated. Sorry for a basic question, but I could not find much information with samples online.
thanks,
Alice
You're on the right track with an item orders link table. You will probably find that you end up using the table for additional columns you haven't considered yet (quantity, price, etc.)
The main thing to do starting with is make sure your database design is right, look up the basic normalization rules about making sure you don't duplicate information. Also when you create your tables make sure you're explicitly telling the database of the the relationships between the tables using FOREIGN KEY and PRIMARY KEY constraints.
Once you have the correct logical structure in place you can see if you have any performance issues that require you to do anything clever.
Relational databases were designed to exactly what you're contemplating though so the performance will probably be much better than you fear. Premature optimization would be a huge mistake.
You mentioned solr, this is a generic text search engine (sort of like google). For your requirements you want to stick to a pure relational database. You want a base that delivers exact results based on exact criteria exactly what products have been included in an order etc. you don't want any fuzzy matching or artificial intelligence guessing about what has been ordered.
You might also store the product catalogue in solr so the user could pick look for products that mention pink,blue or purple in the description and comes in a size 4 etc, then once the product has been chosen use the itemId in the relational database.

Auto-Complete/Primary Key as String - PostgreSQL

I setup a database that is not too complex but still nonetheless has multiple many-to-many relationships. Let me explain the database first briefly using three tables(there are many more, but just to keep things simple):
Database is storing information about projects completed. One attribute is software used. So I have three tables(with respective columns/keys):
tblProjects(ProjectID[PK], ProjectTitle, etc...)
tblProjectsSoftware(SoftwareID[FK], ProjectID[FK], UniqueID[PK])
tblSoftwareUsed(SoftwareID[PK], SoftwareName)
In order to make data entry easier in phppgadmin, I was considering just making 'SoftwareName' the primary key in tblSoftwareUsed. This is because when I go to enter the software associated with certain projects into tblProjectsSoftware, I can only use the auto-complete feature on the SoftwareID column which is just more or less a meaningless number.
As you can see, when entering data into the SoftwareID column of tblSoftwareUsed, I would only be able to 'filter' results by the ID and not the name. When this database gets large, it may not be an issue for software, but there are some other attributes that will have tons of records. To explain that further, I would start my data entry by creating a record for the project in tblProjects. Then I would create new records (if necessary) for software used. Then, when entering data into tblProjectsSoftware, I would either have to know the ID of the software or click through a few pages to find it.
So, my question is, would I have any issues by making the name of the software my Primary Key, or would it be better to just leave it as is with the ID as the PK? Furthermore, maybe I am missing an option to make 'SoftwareName' searchable as in addition to the ID.
There are advantages and disadvantages to using surrogate keys, which are discussed at length in this wikipedia article:
http://en.wikipedia.org/wiki/Surrogate_key
Borrowing their headers...
Advantages:
Immutability
Requirement changes
Performance
Compatibility
Uniformity
Validation
Disadvantages:
Disassociation
Query optimization
Normalization
Business process modeling
Inadvertent disclosure
Inadvertent assumptions
More often than not, you'll want to use a surrogate key for practical reasons -- such as avoiding headaches when you need to update a software name.

Junction Table with mutiple primary keys

I started to learn Atk just a week ago and I decided to reimplement the business intranet which is becoming unmaintainable.
The model abstraction is very cool to use but I wonder how to specify multiple primary keys for my junction tables.
For example, I have sites and I want to assign machines on them for a period.
Junction table
This is very important to me to not touch the database.
I guess you can do that using joins in your Models if you're not allowed to add simple auto_incremented id field in that junction table.
But adding id field will be much more easier for you and also will work better in future.
For example, you can have some issues with deleting record without unique id field. I know (id_machine,id_site) is most likely unique in your case, but it's still hard to work with. One simple id field is easier, faster and better :)
Dark has a good suggestion. You can also create a calculated ID field, I have never tried it myself but do try it
$model->getElement('id')->destroy();
$model->addExpression('id')->set('contact(id_machine,"-",id_site)');

Database schema design

I'm quite new to database design and have some questions about best practices and would really like to learn.
I am designing a database schema, I have a good idea of the requirements and now its a matter of getting it into black and white.
In this pseudo-database-layout, I have a table of customers, table of orders and table of products.
TBL_PRODUCTS:
ID
Description
Details
TBL_CUSTOMER:
ID
Name
Address
TBL_ORDER:
ID
TBL_CUSTOMER.ID
prod1
prod2
prod3
etc
Each 'order' has only one customer, but can have any number of 'products'.
The problem is, in my case, the products for a given order can be any amount (hundreds for a single order) on top of that, each product for an order needs more than just a 'quantity' but can have values that span pages of text for a specific product for a specific order.
My question is, how can I store that information?
Assuming I can't store a variable length array as single field value, the other option is to have a string that is delimited somehow and split by code in the application.
An order could have say 100 products, each product having either only a small int, or 5000 characters or free text (or anything in between), unique only to that order.
On top of that, each order must have it's own audit trail as many things can happen to it throughout it's lifetime.
An audit trail would contain the usual information - user, time/date, action and can be any length.
Would I store an audit trail for a specific order in it's own table (as they could become quite lengthy) created as the order is created?
Are there any places where I could learn more about techniques for database design?
The most common way would be to store the order items in another table.
TBL_ORDER:
ID
TBL_CUSTOMER.ID
TBL_ORDER_ITEM:
ID
TBL_ORDER.ID
TBL_PRODUCTS.ID
Quantity
UniqueDetails
The same can apply to your Order audit trail. It can be a new table such as
TBL_ORDER_AUDIT:
ID
TBL_ORDER.ID
AuditDetails
First of all, Google Third Normal Form. In most cases, your tables should be 3NF, but there are cases where this is not the case because of performance or ease of use, and only experiance can really teach you that.
What you have is not normalized. You need a "Join table" to implement the many to many relationship.
TBL_ORDER:
ID
TBL_CUSTOMER.ID
TBL_ORDER_PRODUCT_JOIN:
ID
TBL_ORDER.ID
TBL_Product.ID
Quantity
TBL_ORDER_AUDIT:
ID
TBL_ORDER.ID
Audit_Details
The basic conventional name for the ID column in the Orders table (plural, because ORDER is a keyword in SQL) is "Order Number", with the exact spelling varying (OrderNum, OrderNumber, Order_Num, OrderNo, ...).
The TBL_ prefix is superfluous; it is doubly superfluous since it doesn't always mean table, as for example in the TBL_CUSTOMER.ID column name used in the TBL_ORDER table. Also, it is a bad idea, in general, to try using a "." in the middle of a column name; you would have to always treat that name as a delimited identifier, enclosing it in either double quotes (standard SQL and most DBMS) or square brackets (MS SQL Server; not sure about Sybase).
Joe Celko has a lot to say about things like column naming. I don't agree with all he says, but it is readily searchable. See also Fabian Pascal 'Practical Issues in Database Management'.
The other answers have suggested that you need an 'Order Items' table - they're right; you do. The answers have also talked about storing the quantity in there. Don't forget that you'll need more than just the quantity. For example, you'll need the price prevailing at the time of the order. In many systems, you might also need to deal with discounts, taxes, and other details. And if it is a complex item (like an airplane), there may be only one 'item' on the order, but there will be an enormous number of subordinate details to be recorded.
While not a reference on how to design database schemas, I often use the schema library at DatabaseAnswers.org. It is a good jumping off location if you want to have something that is already roughed in. They aren't perfect and will most likely need to be modified to fit your needs, but there are more than 500 of them in there.
Learn Entity-Relationship (ER) modeling for database requirements analysis.
Learn relational database design and some relational data modeling for the overall logical design of tables. Data normalization is an important part of this piece, but by no means all there is to learn. Relational database design is pretty much DBMS independent within the main stream DBMS products.
Learn physical database design. Learn index design as the first stage of designing for performance. Some index design is DBMS independent, but physical design becomes increasingly dependent on special features of your DBMS as you get more detailed. This can require a book that's specifically tailored to the DBMS you intend to use.
You don't have to do all the above learning before you ever design and build your first database. But what you don't know WILL hurt you. Like any other skill, the more you do it, the better you'll get. And learning what other people already know is a lot cheaper than learning by trial and error.
Take a look at Agile Web Development with Rails, it's got an excellent section on ActiveRecord (an implementation of the same-named design pattern in Rails) and does a really good job of explaining these types of relationships, even if you never use Rails. Here's a good online tutorial as well.

Resources