I have a dataset that I need to work with that represents a part schematic for a large machine. I need to come up with an appropriate database schema for this dataset and am having trouble coming up with something to use that represents this data efficiently.
The top level components are the biggest "structures", and as you traverse down the hierarchy, the data represents inner components, or components that make up the inner components. For example, at the top level, there could be an engine as a level 1 component, and then a level 2 component is a piston, which goes into an engine, and a level 3 component could be a gasket that goes into the piston.
This representation is spread across a few hundred lines of a CSV file. There are 3 columns for IDs:
a master_id, which all components have
a parent_id, which all components have as well but their value varies based on the situation.
If the component in question is a level 1 part, the parent_id is its own master_id.
If the component in question is a level 2 part, the parent_id is the master_id of the level 1 component.
If the component in question is a level 3 part, the parent_id is the master_id of the level 2 component.
Basically, the parent id of any component is the master id of the component in the level above it. So lv1 parent is lv1 master (since it' s the root), lv2 parent is lv1 master, and lv3 is lv2 master. Also, multiple components can share a parent ID, meaning multiple lv2 parts, for example, can have the same parent ID.
a grandparent_id, which only level 3 components have (but not all lv3 components for some reason (idk I didn't make this data set)). If a component is lv3 and has a grandparent_id, the grandparent ID is a direct link back to the master ID of the lv1 component. Yeah, confusing right?
So here's an example. A lv3 component has a master_id of 700000137, a parent_id of 600000049, and a grandparent_id of 500000006. If we look at the component with a master of 600000049, we'll see that this is a lv2 component that has a parent id of 500000006, which is the master id of a lv1 component, and again is the grandparent of this lv3 component.
I prefaced this post saying I need to come up with a database representation for this data set (it has later use in a project but the data organization is the first step). I'm comfortable using PostgreSQL, so my initial thoughts were to make 3 tables, master, parent, and grandparent, where based on the key that I'm parsing out, I would insert this into the appropriate database and foreign key back to the other tables if there were parent or grandparent keys. But I realized this could get quite hairy especially since there could be multiple foreign keys linking back to a single master id, and I feel with this representation some data could possibly get repeated, which I obviously don't want happening.
My second thought was to use something like a python dictionary, where I essentially build out a tree like structure where the lv1 components are in the top level, the lv2 components in the second, etc. I could then convert the dictionary into JSON, since Python is nice that way, and store that json blob in the database. But, this JSON blob could potentially get REALLY big, though I guess that's just something I'd have to live with as the dataset grows. This part schematic I was given is only for one machine, so basically each entry in my database would be like
id | name | json
----------------------
1 | machine_a | JSON_BLOB_MACHINE_A
----------------------
2 | machine_b | JSON_BLOB_MACHINE_B
etc...
does my second approach seem better than trying to create separate tables that represent each part level and foreign keying back to parents? If there's a better way to do this with Postgres, I'd appreciate you explaining it. Otherwise, I'm probably going to go with the latter route. Thanks!
If you don't need to join parts in other machines, then I think a jsonb column for parts may be best. You can still index jsonb using GIN indexes and get really good performance from queries.
As long as the parts are not shared among many machines, which would make updating part properties across all machines tricky, then you probably OK.
This should make queries for a machine pretty effortless as majority of the data is self-contained.
Related
I have a question regarding the correct implementation of a Schema that I'm currently wrecking my head with:
We have machines, which consist of components, which consist of parts.
However, the relationships are as follows:
Machines (1) --> Components (N) - a machine is made up of various
components
Components (N) --> Parts (N) - a component is made up of
multiple parts, a part may be used in multiple components
Components (N) --> Components (N) - a component can also be made up of other components
Machines (N) --> Parts (N) - Some parts may also be directly assigned to
a machine
Furthermore, both parts and components that are flagged as needs_welding=1 will have a price associated with them. These prices will change over time.
I'm not quite sure as to how to model the following aspects:
How to relate the Parts directly to the machine table
How to model the parent/child relationship between the components
How to attach prices to the items (kinda reminds me of an SCD in a DWH, but I cannot seem to patch it together)
A good solution for N->N mappings is to create a specific mapping table. So, for example, to map a Component to the Part(s) it is made of, you can create a table called something like
MapComponentToItsParts, which has two columns, the first which contains the ID of the component, the second which contains the ID of the part. They should each be Foreign Keys to their respective tables. You can create similar tables to MapComponentToSubComponent, or MapMachineToPart.
I am designing database model for some application, and I have one table Post which belong to some category. OK, Category will logically be other table.
But, more categories belong to some super category or domain or area, and my question is next:
Whether create other table for super categories or domains, or to do this hierarchy in table Category with some combination of key to point to parent.
I hope I was clear with problem?
PS.I know that I can do this problem with both solution, but is there any benefits with using first over second solution, and contrary.
Thanks
It depends: if nearly each category has a parent, you could add a parent serial as a column. Then your category table will look like
+--+----+------+
|ID|Name|Parent|
+--+----+------+
The problem with this representation is that, as long the hierarchy is not cyclic, some categories will have no parent. Furthermore a category can only have one parent.
Therefore I would suggest using a category_hierarchy table. An additional table:
+-----+------+
|Child|Parent|
+-----+------+
The disadvantage of this approach is that nearly each category will be repeated. And therefore if nearly all categories have parents, the redundancy will approximately scale with that number. If relations however are quite sparse, one saves space. Furthermore using an intelligent join will prevent the second representation from taking long execution times. You can for instance define a view to handle such requests.
Furthermore there are situations where the second approach can improve speed. For instance if you don't need the hierarchy all the time (for instance when mapping serials to the category-name), lookups in the category table can be faster, simply because the table is more compact and thus more parts of the table will be cached.
I have an object structure in C# which I'm persisting to SQL Server 2008 in a pattern similar to what is described here. Basically I have some states, which different properties. There is a main State table which has an Id which is FK on the subtype tables, see attached image (there is a large number of states in the implementation).
Now, I'd like to get the properties of a given state (known Id). For instance, say Id 5 denotes an Active state, I'd like to get the values of prop1 and prop2 in the diagram.
The only way I can come up with is joining all the state tables (knowing that I will only get a match from one). Is there a better way of accomplishing this?
I am having trouble arriving at a normalized relational database design to describe a small hierarchy which deviates enough from the typical hierarchy examples such that I am unsure how to proceed my first time tackling such a problem.
My problem is as follows:
Each branch in the hierarchy is guaranteed to be either 2, 4, or 6 levels deep.
If it is 2 levels deep, the hierarchy looks like this:
Category / Group / Component
If it is 4 levels deep, it looks like this:
Category / Group / Component / Group / Component
If it is 6 levels deep, it looks like this:
Category / Group / Component / Group / Component / Group / Component
Categories, Groups, and Components each have their own set of attributes. To further complicate matters, a relationship exists between a Component and entity A, a Component and entity B, and a Component and entity C.
My original thought was to strive to keep the Components in one table, however, I have been unable to come up with a normalized solution that satisfies this goal.
Instead, I came up with a normalized solution where there is a separate table for Components at each of the three possible component levels. However, I am not really comfortable with this because it triples the number of tables capturing links between components and entitites A, B, and C (9 total link tables rather than 3 if all components were in one table).
Here is what the design I came up with looks like:
TABLE: Group_1_Components
ATTRIBUTES: Row_ID, Category, Component
RELATES-TO: Group_1_Components_A_Links, Group_1_Components_B_Links, Group_1_Components_C_Links, Group_2_Components
TABLE: Group_2_Components
ATTRIBUTES: Row_ID, Group, Component, Group_1_Component_Row_ID
RELATES-TO: Group_2_Components_A_Links, Group_2_Components_B_Links, Group_2_Components_C_Links, Group_1_Components, Group_3_Components
TABLE: Group_3_Components
ATTRIBUTES: Row_ID, Group, Component, Group_2_Component_Row_ID
RELATES-TO: Group_3_Components_A_Links, Group_3_Components_B_Links, Group_3_Components_C_Links, Group_2_Components
Each of the 9 links tables contain two Row IDs to address a many-to-many relationship with either table A, B, or C.
Is this a reasonable design or am I overlooking a simpler, more typical solution? I looked at a few design techniques specific to capturing hierarchies in a relational database, notably the adjacency list, but I am not sure they fit here, nor do they appear to be normalized solutions.
It should be noted that the hierarchy will be seldomly modified; it will frequently be read where reads retrieve either all of the components or components at a specific level for a selected group. The link tables to entities A, B, and C will be written to regularly.
Any and all suggestions are welcome. Thanks in advance for your help. Brian
I suggest that you de-normalize your data so that your hierarchy is based on component/group entities, so that you match "regular" hierarchies. In this case you can have the following tables:
a) Components
b) Groups
c) Component_Groups - with a unique key on component_id and group_id to ensure that you only have one combination for each component and group
In this case then your hierarchy will be: Category -> Component_Group -> Component_Group -> Component_Group
Another option for this kind of problem is using a self-referencing table. Just one table.
Single table with ID, PARENT_ID and a TYPE so you can distinguish CATEGORY, GROUP and COMPONENT.
All categories would have no PARENT_ID and then you could search for all child objects where the parent id is equal to the id of the category you want to dive deeper into.
I'm trying to design a database for a product aggregator. Each product has information about where it comes from, what it costs, what type of thing it is, price, color, etc. Users need to able to search and filter results based on any of those product categories. I also expect to have a large number of users. My initial thought was having one big table with every product in it with a column for each piece of information and an index on anything I need to be able to search by but I think this might be inefficient with a lot of users pounding on this one table. My other thought was to organize the database to promote a tree-like navigation of tables but because you can search by anything I'm not sure how I would organize the tables.
Any thoughts on some good practices?
One table of products - databases are designed to have lots of users pounding on tables.
(from the comments)
You need to model your data. This comes from looking at the all the data you have, determining what is related to what (a table is called a relation because all the attributes in a row are related to a candidate key). You haven't really given enough information about the scope of what data (unstructured?) you have on these products and how it varies. Are you going to have difficulties because Shoes have brand, model, size and color, but Desks only have brand, model and finish? All this is going to inform your data model. Typically you have one products table, and other things link to it.
Some of those attributes will be foreign keys to lookup tables, others (price) would be simple scalars. Appropriate indexing and you'll be fine. For advanced analytics, consider a dimensionally modeled star-schema, but perhaps not for your live transaction system - depends what your data flow/workflow/transactions are. Or consider some benefits of its principles in your transactional database. Ralph Kimball is source of good information on dimensional modeling.
I dont see any need for the tree structure here. You can do with single table.
if you insist on tree structure with hierarchy here is an example to get you started.
For text based search, and ease of startup & design, I strongly recommend Apache SOLR. The SOLR API is easy to use (especially JSON). Databases do text search poorly, and I would instead recommend that you just make sure that they respond to primary/unique key queries properly, and those are the fields you should index.
One table for the products, and another table for the product category hierarchy (you don't specifically say you have this but "tree-like navigation of tables" makes me think you might).
I can see you might be concerned about over-indexing causing problems if you plan to index almost every column. In that case, it might be best to index on the top 5 or 10 columns you think users are likely to search for, unless it's possible for a user to search on ANY column. In that case you might want to look at building a data warehouse. Maybe you'll want to look into data cubes to see if those will help...?
For hierarchical data, you need a PRODUCT_CATEGORY table looking something like this:
ID
PARENT_ID
NAME
Some sample data:
ID PARENT_ID NAME
1 ROOT
2 1 SOCKS
3 1 HELICOPTER PARTS
4 2 ARGYLE
Some SQL engines (such as Oracle) allow you to write recursive queries to traverse the hierarchy in a single query. In this example, the root of the tree has a PARENT_ID of NULL, but if you don't want this column to be nullable, I've also seen -1 used for the same purposes.