TLDR: Looking for a free database option to run locally that lends itself to composition. Object A is composed of B,C,D. Where B,C are of the same type as A. What should I use?
I would like to experiment with some open source database. I am playing a game that has crafting and it is a bit cumbersome to drill down into the objects to figure out what resources I need. This seems like a good problem to solve via a database. I was hoping to explore a NoSQL option as I do not have much experience with them.
To use a simple contrived example:
A staff: requires 5 wood
A spearhead: requires 2 iron
A spear: requires 1 staff, 1 spearhead
A trident: requires 1 staff, 3 spearheads, 2 iron
If I wanted to build 2 tridents and 1 spear a query to my database would inform me I need 15 wood, 18 iron.
So each craftable item would require some combination of base resources and/or other crafted items. The question to be put to the database is, given that I already possess some resources, what is remaining for me to collect to build some combination of items?
If I were to attempt this in SQL I would make 4 tables:
A resource table (the raw materials needed for crafting)
An item table (the things I can craft)
A many to many table, mapping items to items
A many to many table, mapping items to resources
What would you recommend I use? An answer might be, there are no NoSQL databases that lend themselves well to your problem set (model and queries).
Using the Bill of Materials picture I linked to in the comment, you have a Resource table.
Resource
--------
Resource ID
Resource Name
Here are some rows, based on your example. I deliberately added spearhead after spear. The order of the resources doesn't matter.
Resource ID | Resource Name
---------------------------
1 Wood
2 Iron
3 Staff
4 Spear
5 Spearhead
6 Trident
Next, you have a ResourceHiearchy table.
ResourceHiearchy
----------------
ResourceHiearchy ID
Resource ID
Parent Resource ID (FK)
Resource Quantity
Here are some rows, again based on your example.
ResourceHiearchy ID | Resource ID | P Resource ID | Resource Quantity
1 6 null null
2 5 6 3
3 3 6 1
4 2 6 2
5 4 3 1
6 4 5 1
7 5 2 2
8 3 1 5
Admittedly, this is difficult to create by hand. I probably made some errors in my example. You would have a part of your application that allows you to create Resource and ResourceHiearchy rows using the actual resource names.
You have to make several queries to retrieve all of the components for a top-level resource, starting with a null Parent ResourceHiearchy ID and querying your way through the resources. That's the disadvantage to a Bill of Materials.
The advantage of a Bill of Materials is there's no limit to the nesting and you can freely combine items and resources to make more items.
You can identify resources and items with a flag in your Resource table if you wish.
You might want to consider a graph data model, such as JanusGraph, where entitles (nodes) could be members of a set (defined as another node) via a relationship (edge).
That would allow you to have multi-child or multi-parent relationships you are talking about.
Mother == married to == Father
child1, child2, child 3 ... childn
Would all then have a "childOf" relationship to both the mother and separately to the father, and would be "siblingOf" the other members of the set, as labeled along their edges.
Make sense?
Here's more of the types of edge labels and multiplicities you can have:
https://docs.janusgraph.org/basics/schema/
Disclosure: I work for ScyllaDB, and our database is often used as a storage engine under JanusGraph implementations. There are many other types of NoSQL graph databases you can check out. Find the one that's right for your use case & data model.
Edit: JanusGraph is open source, as is Scylla:
https://github.com/JanusGraph
https://github.com/scylladb/scylla
Related
We are in the process of implementing a new data store using Data Vault methodology in the Snowflake database. Our intention is to hold to the latest standards and best practices as far as we can e.g. an insert only approach and attempting to avoid various anti-patterns such as driving key relationships wherever practicable (see comments here for discussion on driving keys).
The following is a simplified example of a section of our data relating to ratings assigned to properties over time (think hotel star ratings or similar).
Central to this is a table connecting a property to a rating. The following example shows the rating history for a single property against two different schemes.
PropertyRatingID PropertyID RatingSchemeID RatingID EffectiveDate
1 1 1 1 2020-01-01
2 1 1 2 2020-01-02
3 1 1 1 2020-01-03
4 1 2 3 2020-01-02
5 1 2 4 2020-01-03
Relevant information regarding the data structure.
PropertyRatingID is an identity key to ensure uniqueness and has no business meaning.
At any given time a property can only have a single rating for a single scheme but can be rated under several schemes
PropertyID,RatingSchemeID and EffectiveDate are not required to be a
unique combination.
EffectiveDate does not represent the date the record was entered in to the system and can be backdated into the past by a significant duration creating a bitemporal situation between effective date and the date the change is applied.
PropertyID, RatingSchemeID and RatingID are all foreign keys to other tables providing descriptive data. Property already exists as a hub in its own right within our model.
A timeline of the above can be pictured as follows.
Date Scheme1Rating Scheme2Rating
2020-01-01 1 NULL
2020-01-02 2 3
2020-01-03 1 4
2020-01-04 1 4
My initial attempt to model this was to build a hub for RatingID, a link between property and rating and a satellite attached to the link using PropertyRatingID holding all other information (primarily BrandID and EffectiveDate) to make it multi-active. This proved very difficult to use because the driving key behind changes (PropertyID and BrandID was spread between the link and satellite).
In terms of the bitemporality of the situation the focus will be on obtaining the most recently recorded set of effective dates (i.e. latest system date) to create a history of ratings over time e.g. EffectiveEndDate becomes the equivalent of LEAD(EffectiveDate) OVER(PARTITION BY PropertyID,RatingID ORDER BY EffectiveDate ASC) on the raw table. While we will not make regular use of records of values from past system times we will on occasion look at these on an ad-hoc basis to explain changes to the history between reporting periods.
A simple solution would be to join across multiple tables within the source system to flatten the data, separate it and produce a satellite per ratings schems and attach it directly to the property hub. This would give us a short term solution but would be inflexible (requiring a new hub for any new rating schemes) and still requires these satellites to be multi-active to hold the multiple effective dates current in the source system.
I believe the ideal solution requires at least one more hub relating to the driving key and potentially a second relating to an assignment of a rating to a property. Much of my reading (see previous link and this article)implies my satellite should be attached to a hub rather than a link.
What would be an effective approach be to model this using the Data Vault methodology?
I would be interested in a discussion of the payoffs of the solution proposed for example of additional "weak" hubs vs resolving driving key issues within more complex queries.
This scenario, as I understand it, is more like a LINK/SAT with PropertyID, RatingSchemeID as the LINK natural key (linking to Property and RatingScheme HUB's) and RatingID in the SAT (hanging off the LINK).
The setup.
I have a table that stores a list of physical items for a game. Items also have a hierarchical list of categories. Example base table:
Items
id | parent_id | is_category | name | description
-- | --------- | ----------- | ------- | -----------
1 | 0 | 1 | Weapon | Something intended to cause damage
2 | 1 | 1 | Ranged | Attack from a distance
3 | 1 | 1 | Melee | Must be able to reach the target with arm
4 | 2 | 0 | Musket | Shoots hot lead.
5 | 2 | 0 | Bomb | Fire damage over area
6 | 0 | 1 | Mount | Something that carries a load.
7 | 6 | 0 | Horse | It has no name.
8 | 6 | 0 | Donkey | Don't assume or you become one.
The system is currently running on PHP and SQLite but the database back-end is flexible and may use MySQL and the front-end may eventually use javascript or Object-C/Swift
The problem.
In the sample above the program must have a different special handling for each of the top level categories and the items underneath them. e.g. Weapon and Mount are sold by different merchants, weapons may be carried while a mount cannot.
What is the best way to flag the top level tiers in code for special handling?
While the top level categories are relatively fixed I would like to keep them in the DB so it is easier to generate the full hierarchy for visualization using a single (recursive) function.
Nearly all foreign keys that identify an item may also identify an item category so separating them into different tables seemed very clunky.
My thoughts.
I can use a string match on the name and store the id in an internal constant upon first execution. An ugly solution at best that I would like to avoid.
I can store the id in an internal constant at install time. better but still not quite what I prefer.
I can store an array in code of the top level elements instead of putting them in the table. This creates a lot of complications like how does a child point to the top level parent. Another id would have to be added to the table that is used by like 100 of the 10K rows.
I can store an array in code and enable identity insert at install time to add the top level elements sharing the identity of the static array. Probably my best idea but I don't really like the idea of identity insert it just doesn't feel "database" to me. Also what if a new top level item appears. Maybe start the ids at 1Million for these categories?
I can add a flag column "varchar(1) top_category" or "int top_category" with a character or bit-map indicating the value. Again a column used on like 10 of 10k rows.
As a software person I tend to fine software solutions so I'm curious if their is a more DB type solution out there.
Original table, with a join to actions.
Yes, you can put everything in a single table. You'd just need to establish unique rows for every scenario. This sqlfiddle gives you an example... but IMO it starts to become difficult to make sense of. This doesn't take care of all scenarios, due to not being able to do full joins (just a limitation of sqlfiddle that is awesome otherwise.)
IMO, breaking things out into tables makes more sense. Here's another example of how I'd start to approach a schema design for some of the scenarios you described.
The base tables themselve look clunky, but it gives so much more flexibility of how the data is used.
tl;dr analogy ahead
A datase isn't a list of outfits, organized in rows. It's where you store the cothes that make up an outfit.
So the clunky feel of breaking things out into separate tables, is actually the benefit of relational datbases. Putting everything into a single table feels efficient and optimized at first... but as you expand complexity... it starts to become a pain.
Think of your schema as a dresser. Drawers are you tables. If you only have a few socks and underware, putting them all in one drawer is efficient. But once you get enough socks, it can become a pain to have them all in the same drawer as your underware. You have dress socks, crew socks, ankle socks, furry socks. So you put them in another drawer. Once you have shirts, shorts, pants, you start putting them in drawers too.
The drive for putting all data into a single table is often driven by how you intend to use the data.
Assuming your dresser is fully stocked and neatly organized, you have several potential unique outfits; all neatly organized in your dresser. You just need to put them together. Select and Joins are you you would assemble those outfits. The fact that your favorite jean/t-shirt/sock combo isn't all in one drawer doesn't make it clunky or inefficient. The fact that they are separated and organized allows you to:
1. Quickly know where to get each item
2. See potential other new favorite combos
3. Quickly see what you have of each component of your outfit
There's nothing wrong with choosing to think of outfit first, then how you will put it away later. If you only have one outfit, putting everything in one drawer is way easier than putting each pieace in a separate drawer. However, as you expand your wardrobe, the single drawer for everything starts to become inefficient.
You typically want to plan for expansion and versatility. Your program can put the data together however you need it. A well organized schema can do that for you. Whether you use an ORM and do model driven data storage; or start with the schema, and then build models based on the schema; the more complex you data requirements become; the more similar both approaches become.
A relational database is meant to store entities in tables that relate to each other. Very often you'll see examples of a company database consisting of departments, employees, jobs, etc. or of stores holding products, clients, orders, and suppliers.
It is very easy to query such database and for example get all employees that have a certain job in a particular department:
select *
from employees
where job_id = (select id from job where name = 'accountant')
and dept_id = select id from departments where name = 'buying');
You on the other hand have only one table containing "things". One row can relate to another meaning "is of type". You could call this table "something". And were it about company data, we would get the job thus:
select *
from something
where description = 'accountant'
and parent_id = (select id from something where description = 'job');
and the department thus:
select *
from something
where description = 'buying'
and parent_id = (select id from something where description = 'department');
These two would still have to be related by persons working in a department in a job. A mere "is type of" doesn't suffice then. The short query I've shown above would become quite big and complex with your type of database. Imagine the same with a more complicated query.
And your app would either not know anything about what it's selecting (well, it would know it's something which is of some type and another something that is of some type and the person (if you go so far as to introduce a person table) is connected somehow with these two things), or it would have to know what description "department" means and what description "job" means.
Your database is blind. It doesn't know what a "something" is. If you make a programming mistake some time (most of us do), you may even store wrong relations (A Donkey is of type Musket and hence "shoots hot lead" while you can ride it) and your app may crash at one point or another not able to deal with a query result.
Don't you want your app to know what a weapon is and what a mount is? That a weapon enables you to fight and a mount enables you to travel? So why make this a secret? Do you think you gain flexibility? Well, then add food to your table without altering the app. What will the app do with this information? You see, you must code this anyway.
Separate entity from data. Your entities are weapons and mounts so far. These should be tables. Then you have instances (rows) of these entities that have certain attributes. A bomb is a weapon with a certain range for instance.
Tables could look like this:
person (person_id, name, strength_points, ...)
weapon (weapon_id, name, range_from, range_to, weight, force_points, ...)
person_weapon(person_id, weapon_id)
mount (mount_id, name, speed, endurance, ...)
person_mount(person_id, mount_id)
food (food_id, name, weight, energy_points, ...)
person_food (person_id, food_id)
armor (armor_id, name, protection_points, ...)
person_armor <= a table for m:n or a mere person.id_armor for 1:n
...
This is just an example, but it shows clearly what entities your app is dealing with. It knows weapons and food are something the person carries, so these can only have a maximum total weight for a person. A mount is something to use for transport and can make a person move faster (or carry weight, if your app and tables allow for that). Etc.
My question title might be a little bit misleading, since I don't know how to word it. Sorry about that.
I have a table called course which holds a list of courses with ID and Name columns
ID Name
---- ---------
1 JAVA
2 C#
3 C++
4 HTML
5 PHP
6 JAVASCRIPT
7 HARDWARE
8 PERL
9 CSS
There is a simple app, student ask if he can enroll a particular course, then the system will check he has finish the prerequisites. To do a particular course, you need to finish one or more prerequisites. Here are some silly examples:
To do JAVA, you have to finish HARDWARE and HTML
To do C++, you have to finish HARDWARE and PHP
To do CSS, you have to finish JAVA
How can I show this relationship in database, do I need to add a new column to achieve this?
Thanks a lot for your help.
You need to introduce a second table called CoursePreRequiste. It could have the following columns:
Id
CourseId
PreReqCourseId
sample entries
Id CourseId PreReqCourseId
--- --------- --------------
1 1 4
2 1 7
The CourseId / PreReqCourseId combination has to be defined as unique in the table. You could of course do away with the Id column in the second column but I personally like to use Id in all my columns, it makes updating the table easier.
You need to define the dependency relationship in your table design. Per your description you have the following senarios:
One Course can depend on many
One Course can be a dependency for many
That is a many-to-many relationship, which is best expressed as an extra table containing a Foreign key to each of the relationship sides (in your case both are courses, from the same table). The new table design should be as Raj highlighted in his answer:
Id CourseId PreReqCourseId
--- --------- --------------
1 1 4
2 1 7
3 2 8
4 2 3
All you need later to know which courses a given course depends on is to run :
SELECT PreReqCourseId FROM ThisNewTable WHERE CourseId = #Value
Remember to change #Value with the Id of the course you are looking for or think about using it as a parameter if your script is called for an application (like C# or else)
Furthering: Database design for dynamic form field validation
How would I model the database when a particular field can have 0 or more validations rules AND that each validation rule is "related" to another rule via AND or OR.
For example, say I have field1 that needs to be minimum of 5 characters AND maximum 10 characters. These are 2 rules that apply to the same field and are related via an "AND." An example of how rules relate via an "OR" would be something like this: field1 should have exactly 5 characters OR exactly 10 characters.
The validation could get complex and have n-levels of nesting. How do I do this in a DB?
I don't think there's a simple answer to how to model this. The following conversation will hopefully get you started, and give you some sense of the issues involved.
So far as I can see, you have at least three types of entity: fields, simple rules, and complex rules (that is, rules made by combining other simple and/or complex rules).
The one piece of good news is that I'm pretty sure you just need two types of complex rule: an AND rule, and an OR rule, each of which applies a set of sub rules, and returns true or false based on the results returned by those subrules.
So you want to build a structure where each form has 1 or more fields, each field has 0 or more validation rules, and each rule has 0 or more sub-rules.
One challenge is just to keep track of the structure of each complex rule. What strikes me as the simplest way to do this is in a tree structure where each node has a parent. So you might have an OR rule with a parent of 0 (indicating that it's a top-level rule). There would then be 2 or more rules with the OR's ruleId as their parent. In turn, any of those might be an AND or OR rule which would be the parent of other rules. And so on down the tree.
Another challenge is how to extract your structure from the db so you can validate a form. It's preferable to minimize the number of queries it takes to do this. In a straight tree, where the structure is only established by children nodes knowing their parents, you'd need a separate query to get each parent's immediate children. So it'd be nice if you could aggregate all the children together under a single ancestor.
If any rule can only be assigned to 1 field, then you can have a fieldId column in your rules table, and each rule will be assigned to a field. Then you can join a form to its fields, and those fields to their rules, and pull out everything in one query. Then the application logic would be responsible for turning the data into a functional tree structure.
However, if you want rules to be reusable, that's not going to work. For example, you might want an abstract zip code rule which combined several sub rules (rather than being a giant regex). And then you might want to make that a US zip code rule, and make another for Canada, and another for any of multiple countries, and then you might want to combine some or all of those depending on which field was being validated. So you might have, for example a US OR Canada zip rule applied to some fields, a US only rule applied to other fields, etc.
One way to do this is to remove the fieldId field from rules, and add a new field_rules junction table with fieldId and ruleId as its columns. However, removing fieldId from fields puts you back into not having a single-query means of extracting all the rules (including sub rules) for a field, never mind for a form. So you might add an origin column to the rules table, and all the subrules of a complex rule would have that top-level field's id as their origin.
Now things might get even more complex if you want to allow overriding some of a reusable rule's data for specific fields. Then you might add either a new field_rule_data table, or just data columns to the field_rules table.
Implementing a tree structure means that your application logic for both building and applying complex rules is probably going to have to be recursive.
Having said all that, I suspect your real challenge is going to be at the UI level.
Edit
I thought about this some more, and it's seeming even more complicated. I'm sure the following is also inadequate, but I hope it will facilitate figuring out a full answer.
I'm now thinking you have 5 tables: rules, rule_defs, rule_defs_index, fields, field_rules. They go something like this:
Rules
rule_id (PK)
name
data (can be null)
Rule_Defs
rule_def_id (PK)
rule_id (FK to rule_id)
parent (FK to rule_def_id)
origin (FK to rule_def_id: optional convenience field)
Rule_Defs_Index
rule_id (FK)
rule_def_id (FK)
Fields
field_id (PK)
name
Field_Rules
field_id (FK and part of PK)
rule_id (FK and part of PK)
Just making stuff up here in a vaguely plausible way, here's some sample data:
Rules
id name data
1 AND
2 OR
3 5 digits /^\d{5,5}$/
4 5-4 pattern /^\d{5,5}-\d{4,4}$/
5 US Zip
6 6 alphanumerics /^[A-Za-z0-9]{6,6}$/
7 US or Canada Zip
Rule_Defs
id rule_id parent origin
1 5 0 1
2 2 1 1
3 3 2 1
4 4 2 1
5 7 0 5
6 2 5 5
7 5 6 5
8 6 6 5
Rule_Defs_Index (just data for US Canada Zip since that's biggest)
rule_id rule_def_id
7 2
7 3
7 4
7 5
7 6
7 7
Fields
field_id name
1 billing zip
2 shipping zip
Field_Rules
field_id rule_id
1 7
2 7
Note that the assumption here is that it creating and editing rules will happen rarely relative to applying rules. Thus creating and editing will be fairly cumbersome and relatively slow activities. To avoid this being the case for the far more common application of rules, the Rule_Defs_Index should make it possible to extract everything needed to build a rule structure for a field (or a form) with a single query. Of course, once it's retrieved, the application will have to do a fair amount of work to turn the data into a useful structure.
Note that you might want to cache the constructed data in serialized form, rebuilding the cache in the relatively rare instances when a rule is edited or created.
I'm working on a design for a hierarchical database structure which models a catalogue containing products (this is similar to this question). The database platform is SQL Server 2005 and the catalogue is quite large (750,000 products, 8,500 catalogue sections over 4 levels) but is relatively static (reloaded once a day) and so we are only concerned about READ performance.
The general structure of the catalogue hierarchy is:-
Level 1 Section
Level 2 Section
Level 3 Section
Level 4 Section (products are linked to here)
We are using the Nested Sets pattern for storing the hierarchy levels and storing the products which exist at that level in a separate linked table. So the simplified database structure would be
CREATE TABLE CatalogueSection
(
SectionID INTEGER,
ParentID INTEGER,
LeftExtent INTEGER,
RightExtent INTEGER
)
CREATE TABLE CatalogueProduct
(
ProductID INTEGER,
SectionID INTEGER
)
We do have an added complication in that we have about 1000 separate customer groups which may or may not see all products in the catalogue. Because of this we need to maintain a separate "copy" of the catalogue hierarchy for each customer group so that when they browse the catalogue, they only see their products and they also don't see any sections which are empty.
To facilitate this we maintain a table of the number of products at each level of the hierarchy "rolled up" from the section below. So, even though products are only directly linked to the lowest level of the hierarchy, they are counted all the way up the tree. The structure of this table is
CREATE TABLE CatalogueSectionCount
(
SectionID INTEGER,
CustomerGroupID INTEGER,
SubSectionCount INTEGER,
ProductCount INTEGER
)
So, onto the problem
Performance is very poor at the top levels of the hierarchy. The general query to show the "top 10" products in the selected catalogue section (and all child sections) is taking somewhere in the region of 1 minute to complete. At lower sections in the hierarchy it is faster but still not good enough.
I've put indexes (including covering indexes where applicable) on all key tables, run it through the query analyzer, index tuning wizard etc but still cannot get it to perform fast enough.
I'm wondering whether the design is fundamentally flawed or whether it's because we have such a large dataset? We have a reasonable development server (3.8GHZ Xeon, 4GB RAM) but it's just not working :)
Thanks for any help
James
Use a closure table. If your basic structure is a parent-child with the fields ID and ParentID, then the structure for a closure table is ID and DescendantID. In other words, a closure table is an ancestor-descendant table, where each possible ancestor is associated with all descendants. You may include a LevelsBetween field if you need. Closure table implementations usually include self-referencing records, i.e. ID 1 is an ancestor of descendant ID 1 with LevelsBetween of zero.
Example:
Parent/Child
ParentID - ID
1 - 2
1 - 3
3 - 4
3 - 5
4 - 6
Ancestor/Descendant
ID - DescendantID - LevelsBetween
1 - 1 - 0
1 - 2 - 1
1 - 3 - 1
1 - 4 - 2
1 - 6 - 3
2 - 2 - 0
3 - 3 - 0
3 - 4 - 1
3 - 5 - 1
3 - 6 - 2
4 - 4 - 0
4 - 6 - 1
5 - 5 - 0
The table is intended to eliminate recursive joins. You push the load of the recursive join into an ETL cycle that you do when you load the data once a day. That shifts it away from the query.
Also, it allows variable-level hierarchies. You won't be stuck at 4.
Finally, it allows you to slot products in non-leaf nodes. A lot of catalogs create "Miscellaneous" buckets at higher levels of the hierarchy to create a leaf-node to attach products to. You don't need to do that since intermediate nodes are included in the closure.
As far as indexing goes, I would do a clustered index on ID/DescendantID.
Now for your query performance. This takes a chunk out but not all. You mentioned a "Top 10". This implies ranking over a set of facts that you haven't mentioned. We need details to help tune those. Plus, this gets only gets the leaf-level sections, not the products. At the very least, you should have an index on your CatalogueProduct that orders by SectionID/ProductID. I would force Section to Product joins to be loop joins based on the cardinality you provided. A report on a catalog section would go to the closure table to get descendants (using a clustered index seek). That list of descendants would then be used to get products from CatalogueProduct using the index by looped index seeks. Then, with those products, you would get the facts necessary to do the ranking.
you might be able to solve the customer groups problem with roles and treeId's but you'll have to provide us with the query.
Might it be possible to calculate the ProductCount and SubSectionCount after the load each day?
If the data is changing only once a day surely it's worthwhile to calculate these figures then, even if some denormalization is required.