Database design for a product aggregator - database

I'm trying to design a database for a product aggregator. Each product has information about where it comes from, what it costs, what type of thing it is, price, color, etc. Users need to able to search and filter results based on any of those product categories. I also expect to have a large number of users. My initial thought was having one big table with every product in it with a column for each piece of information and an index on anything I need to be able to search by but I think this might be inefficient with a lot of users pounding on this one table. My other thought was to organize the database to promote a tree-like navigation of tables but because you can search by anything I'm not sure how I would organize the tables.
Any thoughts on some good practices?

One table of products - databases are designed to have lots of users pounding on tables.
(from the comments)
You need to model your data. This comes from looking at the all the data you have, determining what is related to what (a table is called a relation because all the attributes in a row are related to a candidate key). You haven't really given enough information about the scope of what data (unstructured?) you have on these products and how it varies. Are you going to have difficulties because Shoes have brand, model, size and color, but Desks only have brand, model and finish? All this is going to inform your data model. Typically you have one products table, and other things link to it.
Some of those attributes will be foreign keys to lookup tables, others (price) would be simple scalars. Appropriate indexing and you'll be fine. For advanced analytics, consider a dimensionally modeled star-schema, but perhaps not for your live transaction system - depends what your data flow/workflow/transactions are. Or consider some benefits of its principles in your transactional database. Ralph Kimball is source of good information on dimensional modeling.

I dont see any need for the tree structure here. You can do with single table.
if you insist on tree structure with hierarchy here is an example to get you started.

For text based search, and ease of startup & design, I strongly recommend Apache SOLR. The SOLR API is easy to use (especially JSON). Databases do text search poorly, and I would instead recommend that you just make sure that they respond to primary/unique key queries properly, and those are the fields you should index.

One table for the products, and another table for the product category hierarchy (you don't specifically say you have this but "tree-like navigation of tables" makes me think you might).
I can see you might be concerned about over-indexing causing problems if you plan to index almost every column. In that case, it might be best to index on the top 5 or 10 columns you think users are likely to search for, unless it's possible for a user to search on ANY column. In that case you might want to look at building a data warehouse. Maybe you'll want to look into data cubes to see if those will help...?
For hierarchical data, you need a PRODUCT_CATEGORY table looking something like this:
ID
PARENT_ID
NAME
Some sample data:
ID PARENT_ID NAME
1 ROOT
2 1 SOCKS
3 1 HELICOPTER PARTS
4 2 ARGYLE
Some SQL engines (such as Oracle) allow you to write recursive queries to traverse the hierarchy in a single query. In this example, the root of the tree has a PARENT_ID of NULL, but if you don't want this column to be nullable, I've also seen -1 used for the same purposes.

Related

I'm unable to normalize my Product table as I have 4 different product types

So because I have 4 different product types (books, magazines, gifts, food) I can't just put all products in one "products" table without having a bunch of null values. So I decided to break each product up into their own tables but I know this is just wrong (https://c1.staticflickr.com/1/742/23126857873_438655b10f_b.jpg).
I also tried creating an EAV model for this (https://c2.staticflickr.com/6/5734/23479108770_8ae693053a_b.jpg), but I got stuck as I'm not sure how to link the publishers and authors tables.
I know this question has been asked a lot but I don't understand ANY of the answer's I've seen. I think this is because I'm a very visual learner and this makes it hard to understand what's being talked about when not a lot of information is given.
Your model is on the right track, except that the product name should be sufficient you don't need Gift name, book name etc. What you put in those tables is the information that is specific to the type of product that the other products don't need. The Product table contains all the common fields. I would use productid in the child tables rather than renaming it giftID, magazineID etc. It is easier to remember what things are celled when you are consistent in nameing them.
Now to be practical, you put as much as you can into the product table especially if you are going to do calculations. I prefer the child tables in this specific case to have what is mostly display information. So product contains the product name, the cost, the type of product, the units the product is sold in etc. The stuff that generally is needed to calculate the cost of an order or to have a report of what was ordered. There may be one or two fields that can contain nulls, but it simplifies the calculation type queries so much it might be worth it.
The meat of the descriptive details though would go in the child table for the type of product. These would usually only be referenced when displaying the product in the shopping area and only one at a time, so you can use the product type to let you only join to the one child table you need for display. So while the order cares about the product number and name and cost calculations, it probably doesn't need to go line by line describing the book ISBN number or the megapixels in a camera. But the description page of the product does need those things.
This approach is not purely relational, although it mostly is, but it does group the information by the meanings of the data and how they will be used which will make the database easier to understand and query. I am a big fan of relational tables because database just work better when they hit at least the third normal form but sometimes you can go too far for practicality, so the meaning of the data and the way you are grouping to use the data (and not just for the user interface, but for later reporting as well) is almost always one of my considerations in design.
Breaking each product type into its own table is fine - let the child tables use the same id as the parent Product table, and create views for the child tables that join with Product
Your case is a classic case of types and subtypes. This is often called class/subclass in object modeling and generalization/specialization in ER modeling. It's a well understood pattern. There are known techniques for dealing with this pattern.
Visit the following tabs, and read the description under the info tab (presented as "learn more"). Also look over the questions grouped under these tags.
single-table-inheritance class-table-inheritance shared-primary-key
If you want to rean in more depth use these buzzwords to search for articles on the web.
You've already discovered and discarded single table inheritance on your own. Other answers have pointed you at shared primary key. Class table inheritance involves a single table for generalized data as well as the four specialized tables. Shared primary key is generally used in conjunction with class table inheritance.

Reverse index implementation with spring-data and postgres

I have a table, say orders, which has a column, say a alphanumeric 15 character long itemId and a bunch of other columns. The same itemId can be repeated say up to 900 times for very popular items, which means the data will be repeated about 900 times. Obviously, we need to separate it out. However, we need the lookup for the list of items to be very quick and efficient. I read up a bit and thought reverse indexing would be a good way to achieve this. However, I am a bit confused on the actual implementation. I couldn't find any examples online as well other than http://blog.frankel.ch/tag/spring-data , but it uses solr. I was thinking of creating a items-orders table, adding a repository class which will have a method to . However, since there is many-many relation between items and orders, it will require a join table. This makes me think that i am on the wrong path as i intended items-orders table itself as a kind of join table as it only as itemId and orderId in it.
I am pretty sure I am doing something wrong. Any pointers are greatly appreciated. Sorry for a basic question, but I could not find much information with samples online.
thanks,
Alice
You're on the right track with an item orders link table. You will probably find that you end up using the table for additional columns you haven't considered yet (quantity, price, etc.)
The main thing to do starting with is make sure your database design is right, look up the basic normalization rules about making sure you don't duplicate information. Also when you create your tables make sure you're explicitly telling the database of the the relationships between the tables using FOREIGN KEY and PRIMARY KEY constraints.
Once you have the correct logical structure in place you can see if you have any performance issues that require you to do anything clever.
Relational databases were designed to exactly what you're contemplating though so the performance will probably be much better than you fear. Premature optimization would be a huge mistake.
You mentioned solr, this is a generic text search engine (sort of like google). For your requirements you want to stick to a pure relational database. You want a base that delivers exact results based on exact criteria exactly what products have been included in an order etc. you don't want any fuzzy matching or artificial intelligence guessing about what has been ordered.
You might also store the product catalogue in solr so the user could pick look for products that mention pink,blue or purple in the description and comes in a size 4 etc, then once the product has been chosen use the itemId in the relational database.

Database Design to handle newsfeed for different activities

I am going to create a new project, where I need users to view their friends activities and actions just like Facebook and LinkedIn.
Each user is allowed to do 5 different types of activities, each activity have different attributes, for example activity X can be public/private for while activity Y will be assigned to categories. Some of actions include 1 users others have 2 or 3 ...etc. Eventually I have to aggregate all these 5 different types of activities on the news feed page.
How can I design a database that is efficient?
I have 3 designs in mind, please let me know your thoughts. Any new ideas will be greatly appreciated!
1- Separate tables: since there are nearly 3-4 different columns for each activity, it would be logical to separate each activity to its own table.
Pros: Clean database, and easy to develop.
Cons: It will need to query the database 5 times and aggregate results to make a single newsfeed page.
2- One big table: This table will hold all activities with many unused columns. A new numeric column will be added called "type" which will indicate the type of activity. Some attributes could be combined in an HStore field (since we are using Postgres), others will be queried a lot so I dont think it is a good thing to include them as in an HStore field.
Pros: Easy to pull newsfeed.
Cons: Lots of read/writes on the same table, the code will be a bit messier so is the database.
3- Hybrid: A solution would be to make one table containing all the newsfeed, with a polymorphic association to other tables that contain details of each specific activity.
Pros: Tidy code and database, easy to add new activities.
Cons: JOIN ALL THE TABLES to make a single newsfeed! Still better than making 5 different queries.
As I am writing this post I am starting to lean towards solution number 2. Please advise!
Thanks
I would consider a graph database for this. Neo4j. It will add very flexible attributes on either nodes (users) or links (types of relations).
For small sets and few joins, SQL databases are faster and more appropriate. But if your starting point is 5 table joins, graph databases seem simpler and offer similar performance (if not better).

Are there any standards/best-practices for managing small non-transactional lookup tables?

I have an ERP application with about 50 small lookup tables containing non-transactional data. Examples are ItemTypes, SalesOrderStatuses etc. There are so many different types and categories and statuses and with every new module new lookup tables are being added. I have a service to provide List objects out of these tables. These tables usually contain only two columns, (Id and Description). They have only a couple of rows, 8 - 10 rows at max.
I am thinking about putting all of them in one table with ID, Description and LookupTypeID. With this one table I will be able to get rid of 50 tables. Is it good idea? Bad Idea? Very bad idea?
Are there any standards/best-practices for managing small lookup tables?
Among some professionals, the single common lookup table is a design error you should avoid. At the very least, it will slow down performance. The reason is that you will have to have a compound primary key for the common table, and lookups via a compound key will take longer than lookups via a simple key.
According to Anith Sen, this is the first of five design errors you should avoid. See this article: Five Simple Design Errors
Merging lookup tables is a bad idea if you care about integrity of your data (and you should!):
It would allow "client" tables to reference the data they were not meant to reference. E.g. the DBMS will not protect you from referencing SalesOrderStatuses where only ItemTypes should be allowed - they are now in the same table and you cannot (easily) separate the corresponding FKs.
It would force all lookup data to share the same columns and types.
Unless you have a performance problems due to excessive JOINs, I recommend you stay with your current design.
If you do, then you could consider using natural instead of surrogate keys in the lookup tables. This way, the natural keys gets "propagated" through foreign keys to the "client" tables, resulting in less need for JOINing, at the price of increased storage space. For example, instead of having ItemTypes {Id PK, Description AK}, only have ItemTypes {Description PK}, and you no longer have to JOIN with ItemTypes just to get the Description - it was automatically propagated down the FK.
You can store them in a text search (ie nosql) database like Lucene. They are ridiculously fast.
I have implemented this to great effect. Note though that there is some initial setup to overcome, but not much. Lucene queries on ids are a snap to write.
The "one big lookup table" approach has the problem of allowing for silly values -- for example "color: yellow" for trucks in the inventory when you only have cars with "color: yellow". One Big Lookup Table: Just Say No.
Off-hand, I would go with the natural keys for the lookup tables unless you would have cases like "the 2012 model CX300R was red but the 2010-2011 models CX300R were blue (and model ID also denotes color)".
Traditionally if you ask a DBA they will say you should have separate tables. If you asked a programmer they would say using the single table is easier. (Makes making a Edit Status webpage very easy you just make one webpage and pass it a different LookupTypeID instead of lots of similar pages)
However now with ORM the SQL and Code to access different status tables is not really any extra effort.
I have used both method and both work fine. I must admit using a single status table is easiest. I have done this for small apps and also enterprise apps and have noticed no performance impacts.
Finally the other field I normally like to add on these generic status tables is a OrderBy field so you can sort the status in your UI by something other than the description if needed.
Sounds like a good idea to me. You can have the ID and LookupTypeID as a multi-attribute primary key. You just need to know what all of the different LookupTypeIDs represent and you should be good as gold.
EDIT: As for the standards/best-practices, I honestly don't have an answer for you. I've only had one semester of SQL/database design so I haven't been all too exposed to the matter.

Do 1 to 1 relations on db tables smell?

I have a table that has a bunch of fields. The fields can be broken into logical groups - like a job's project manager info. The groupings themselves aren't really entity candidates as they don't and shouldn't have their own PKs.
For now, to group them, the fields have prefixes (PmFirstName for example) but I'm considering breaking them out into multiple tables with 1:1 relations on the main table.
Is there anything I should watch out for when I do this? Is this just a poor choice?
I can see that maybe my queries will get more complicated with all the extra joins but that can be mitigated with views right? If we're talking about a table with less than 100k records is this going to have a noticeable effect on performance?
Edit: I'll justify the non-entity candidate thoughts a little further. This information is entered by our user base. They don't know/care about each other. So its possible that the same user will submit the same "projectManager name" or whatever which, at this point, wouldn't be violating any constraint. Its for us to determine later on down the pipeline if we wanna correlate entries from separate users. If I were to give these things their own key they would grow at the same rate the main table grows - since they are essentially part of the same entity. At no pt is a user picking from a list of available "project managers".
So, given the above, I don't think they are entities. But maybe not - if you have further thoughts please post.
I don't usually use 1 to 1 relations unless there is a specific performance reason for it. For example storing an infrequently used large text or BLOB type field in a separate table.
I would suspect that there is something else going on here though. In the example you give - PmFirstName - it seems like maybe there should be a single pm_id relating to a "ProjectManagers" or "Employees" table. Are you sure none of those groupings are really entity candidates?
To me, they smell unless for some rows or queries you won't be interested in the extra columns. e.g. if for a large portion of your queries you are not selecting the PmFirstName columns, or if for a large subset of rows those columns are NULL.
I like the smells tag.
I use 1 to 1 relationships for inheritance-like constructs.
For example, all bonds have some basic information like CUSIP, Coupon, DatedDate, and MaturityDate. This all goes in the main table.
Now each type of bond (Treasury, Corporate, Muni, Agency, etc.) also has its own set of columns unique to it.
In the past we would just have one incredibly wide table with all that information. Now we break out the type-specific info into separate tables, which gives us much better performance.
For now, to group them, the fields have prefixes (PmFirstName for example) but I'm considering breaking them out into multiple tables with 1:1 relations on the main table.
Create a person table, every database needs this. Then in your project table have a column called PMKey which points to the person table.
Why do you feel that the group of fields are not an entity candidates? If they are not then why try to identify them with a prefix?
Either drop the prefixes or extract them into their own table.
It is valuable splitting them up into separate tables if they are separate logical entities that could be used elsewhere.
So a "Project Manager" could be 1:1 with all the projects currently, but it makes sense that later you might want to be able to have a Project Manager have more than one project.
So having the extra table is good.
If you have a PrimaryFirstName,PrimaryLastName,PrimaryPhone, SecondaryFirstName,SecondaryLastName,SEcondaryPhone
You could just have a "Person" table with FirstName, LastName, Phone
Then your original Table only needs "PrimaryId" and "SecondaryId" columns to replace the 6 columns you previously had.
Also, using SQL you can split up filegroups and tables across physical locations.
So you could have a POST table, and a COMMENT Table, that have a 1:1 relationship, but the COMMENT table is located on a different filegroup, and on a different physical drive with more memory.
1:1 does not always smell. Unless it has no purpose.

Resources