Hierarchical Data Structure Design (Nested Sets) - sql-server

I'm working on a design for a hierarchical database structure which models a catalogue containing products (this is similar to this question). The database platform is SQL Server 2005 and the catalogue is quite large (750,000 products, 8,500 catalogue sections over 4 levels) but is relatively static (reloaded once a day) and so we are only concerned about READ performance.
The general structure of the catalogue hierarchy is:-
Level 1 Section
Level 2 Section
Level 3 Section
Level 4 Section (products are linked to here)
We are using the Nested Sets pattern for storing the hierarchy levels and storing the products which exist at that level in a separate linked table. So the simplified database structure would be
CREATE TABLE CatalogueSection
(
SectionID INTEGER,
ParentID INTEGER,
LeftExtent INTEGER,
RightExtent INTEGER
)
CREATE TABLE CatalogueProduct
(
ProductID INTEGER,
SectionID INTEGER
)
We do have an added complication in that we have about 1000 separate customer groups which may or may not see all products in the catalogue. Because of this we need to maintain a separate "copy" of the catalogue hierarchy for each customer group so that when they browse the catalogue, they only see their products and they also don't see any sections which are empty.
To facilitate this we maintain a table of the number of products at each level of the hierarchy "rolled up" from the section below. So, even though products are only directly linked to the lowest level of the hierarchy, they are counted all the way up the tree. The structure of this table is
CREATE TABLE CatalogueSectionCount
(
SectionID INTEGER,
CustomerGroupID INTEGER,
SubSectionCount INTEGER,
ProductCount INTEGER
)
So, onto the problem
Performance is very poor at the top levels of the hierarchy. The general query to show the "top 10" products in the selected catalogue section (and all child sections) is taking somewhere in the region of 1 minute to complete. At lower sections in the hierarchy it is faster but still not good enough.
I've put indexes (including covering indexes where applicable) on all key tables, run it through the query analyzer, index tuning wizard etc but still cannot get it to perform fast enough.
I'm wondering whether the design is fundamentally flawed or whether it's because we have such a large dataset? We have a reasonable development server (3.8GHZ Xeon, 4GB RAM) but it's just not working :)
Thanks for any help
James

Use a closure table. If your basic structure is a parent-child with the fields ID and ParentID, then the structure for a closure table is ID and DescendantID. In other words, a closure table is an ancestor-descendant table, where each possible ancestor is associated with all descendants. You may include a LevelsBetween field if you need. Closure table implementations usually include self-referencing records, i.e. ID 1 is an ancestor of descendant ID 1 with LevelsBetween of zero.
Example:
Parent/Child
ParentID - ID
1 - 2
1 - 3
3 - 4
3 - 5
4 - 6
Ancestor/Descendant
ID - DescendantID - LevelsBetween
1 - 1 - 0
1 - 2 - 1
1 - 3 - 1
1 - 4 - 2
1 - 6 - 3
2 - 2 - 0
3 - 3 - 0
3 - 4 - 1
3 - 5 - 1
3 - 6 - 2
4 - 4 - 0
4 - 6 - 1
5 - 5 - 0
The table is intended to eliminate recursive joins. You push the load of the recursive join into an ETL cycle that you do when you load the data once a day. That shifts it away from the query.
Also, it allows variable-level hierarchies. You won't be stuck at 4.
Finally, it allows you to slot products in non-leaf nodes. A lot of catalogs create "Miscellaneous" buckets at higher levels of the hierarchy to create a leaf-node to attach products to. You don't need to do that since intermediate nodes are included in the closure.
As far as indexing goes, I would do a clustered index on ID/DescendantID.
Now for your query performance. This takes a chunk out but not all. You mentioned a "Top 10". This implies ranking over a set of facts that you haven't mentioned. We need details to help tune those. Plus, this gets only gets the leaf-level sections, not the products. At the very least, you should have an index on your CatalogueProduct that orders by SectionID/ProductID. I would force Section to Product joins to be loop joins based on the cardinality you provided. A report on a catalog section would go to the closure table to get descendants (using a clustered index seek). That list of descendants would then be used to get products from CatalogueProduct using the index by looped index seeks. Then, with those products, you would get the facts necessary to do the ranking.

you might be able to solve the customer groups problem with roles and treeId's but you'll have to provide us with the query.

Might it be possible to calculate the ProductCount and SubSectionCount after the load each day?
If the data is changing only once a day surely it's worthwhile to calculate these figures then, even if some denormalization is required.

Related

Unlimited levels of hierarchy in SQL table - PostgreSQL

I am looking for a way to store and handle unlimited level of hierarchy for various organisations/entities stored in my DB. For example, instead of having just one parent and one child organisation (e.g. 2 levels of hierarchy) and just one-to-many relationship as allowed by self-join (e.g. having another column called parent referring to the IDs of the same table), I want to be able to have as many levels of hierarchy as possible and as many connections as possible.
Supposing I have an organisation table such as the following:
ID
Name
Other Non-related data
1
Test1
NULL
2
Test2
NULL
3
Test3
something
4
Test4
something else
5
Test5
etc
I am considering the following solution; for each table that I need this I can add another table named originalTable_hierarchy which refers to the organisation table in both columns and make it look like this:
ID
Parent ID
ChildID
1
1
2
2
2
4
3
3
1
4
3
2
5
2
3
From this table I can tell that 1 is parent to 2, 2 is parent to 4, 3 is parent to 1, 3 is also parent to 2, 2 is also parent to 3.
The restrictions I can think of are not to have the same ParentID and ChildID (e.g. a tuple like (3,3)) and not to have a record that puts them into the opposite order (e.g. if I have the (2,3) tuple, I can't also have (3,2))
Is this the correct solution for multiple organisations and suborganisations I might have later on? Users will have to navigate through them easily back and forth. If users decide to split one organisation into many, does this solution suffice? What else should I consider (extra or missing perks) when doing this instead of a traditional self-join or a certain number of tables for certain levels of hierarchy (e.g. organisaion table and suborganisation table)? Also, can you impose restrictions on certain records, so that no more childs of a certain parent can be created? Or to report on all the childs of an original parent?
Please feel free to also instruct on where to read more about this. Any relevant resources are welcome.
You only need a single table as having just one parent and one child allows an unlimited (theoretical anyway) levels in the hierarchy. You do this by reversing the relationship so that the Child references the Parent. (Your table has the Parent referencing the Child). This results in allowing a child, at any level, also being a parent. This can be chained as far as needed.
create table organization ( id integer primary key
, name text
, parent_id integer references organization(id)
, constraint parent_not_self check (parent_id <> id)
) ;
create unique index organization_not__mirrored
on organization( least(id,parent_id), greatest(id,parent_id) );
The check constraint enforces you first restriction and the unique index the second.
The following query shows the full hierarchy, along with the full path and the level.
with recursive hier(parent_id, child_id, path, level) as
( select id, parent_id, id::text, 1
from organization
where parent_id is null
union all
select o.id, o.parent_id,h.path || '->' ||o.id::text,h.level+1
from organization o
join hier h
on (o.parent_id = h.parent_id)
)
select * from hier;
See demo here.

What database lends itself to a composition relationship between entites?

TLDR: Looking for a free database option to run locally that lends itself to composition. Object A is composed of B,C,D. Where B,C are of the same type as A. What should I use?
I would like to experiment with some open source database. I am playing a game that has crafting and it is a bit cumbersome to drill down into the objects to figure out what resources I need. This seems like a good problem to solve via a database. I was hoping to explore a NoSQL option as I do not have much experience with them.
To use a simple contrived example:
A staff: requires 5 wood
A spearhead: requires 2 iron
A spear: requires 1 staff, 1 spearhead
A trident: requires 1 staff, 3 spearheads, 2 iron
If I wanted to build 2 tridents and 1 spear a query to my database would inform me I need 15 wood, 18 iron.
So each craftable item would require some combination of base resources and/or other crafted items. The question to be put to the database is, given that I already possess some resources, what is remaining for me to collect to build some combination of items?
If I were to attempt this in SQL I would make 4 tables:
A resource table (the raw materials needed for crafting)
An item table (the things I can craft)
A many to many table, mapping items to items
A many to many table, mapping items to resources
What would you recommend I use? An answer might be, there are no NoSQL databases that lend themselves well to your problem set (model and queries).
Using the Bill of Materials picture I linked to in the comment, you have a Resource table.
Resource
--------
Resource ID
Resource Name
Here are some rows, based on your example. I deliberately added spearhead after spear. The order of the resources doesn't matter.
Resource ID | Resource Name
---------------------------
1 Wood
2 Iron
3 Staff
4 Spear
5 Spearhead
6 Trident
Next, you have a ResourceHiearchy table.
ResourceHiearchy
----------------
ResourceHiearchy ID
Resource ID
Parent Resource ID (FK)
Resource Quantity
Here are some rows, again based on your example.
ResourceHiearchy ID | Resource ID | P Resource ID | Resource Quantity
1 6 null null
2 5 6 3
3 3 6 1
4 2 6 2
5 4 3 1
6 4 5 1
7 5 2 2
8 3 1 5
Admittedly, this is difficult to create by hand. I probably made some errors in my example. You would have a part of your application that allows you to create Resource and ResourceHiearchy rows using the actual resource names.
You have to make several queries to retrieve all of the components for a top-level resource, starting with a null Parent ResourceHiearchy ID and querying your way through the resources. That's the disadvantage to a Bill of Materials.
The advantage of a Bill of Materials is there's no limit to the nesting and you can freely combine items and resources to make more items.
You can identify resources and items with a flag in your Resource table if you wish.
You might want to consider a graph data model, such as JanusGraph, where entitles (nodes) could be members of a set (defined as another node) via a relationship (edge).
That would allow you to have multi-child or multi-parent relationships you are talking about.
Mother == married to == Father
child1, child2, child 3 ... childn
Would all then have a "childOf" relationship to both the mother and separately to the father, and would be "siblingOf" the other members of the set, as labeled along their edges.
Make sense?
Here's more of the types of edge labels and multiplicities you can have:
https://docs.janusgraph.org/basics/schema/
Disclosure: I work for ScyllaDB, and our database is often used as a storage engine under JanusGraph implementations. There are many other types of NoSQL graph databases you can check out. Find the one that's right for your use case & data model.
Edit: JanusGraph is open source, as is Scylla:
https://github.com/JanusGraph
https://github.com/scylladb/scylla

Database architecture - Have 2 seperate columns or 1

Okey, context:
I have a system that requires to do a monthly, weekly and dayly reports.
Architecture A:
3 tables:
1) Monthly reports
2) Weekly reports
3) Daily reports
Architecture B:
1 table:
1) Reports: With extra column report_type, with values: "monthly", "weekly", "daily".
Which one would be more performant and why?
The common method I use to do this is use two tables, following similarly to your B approach. One table would be as you describe with report data and an extra column, but instead of hard coding the values, this column would hold an id to a reference table. The reference table would then hold the names of these values. This set up allows you to easily reference the intervals with other tables should you need that later on, but also makes name updates much more efficient. Changing the name of say "Monthly" to "Month" would require one update here, vs n updates if you stored the string in your report table.
Sample structure:
report_data | interval_id
xxxx | 1
interval_id | name
1 | Monthly
As a side note, you would rarely want to take your first approach, approach A, due to how it limits changing the interval type of entered data. If all of a sudden you want to change half of your Daily entries to Weekly entries, you need to do n/2 deletes and n/2 inserts, which is fairly costly especially if you start introducing indexes. In general tables should describe types of data (ie Reports) and columns should describe that type (ie How often a report happens)

multiple stores (sId), multiple products(pId) different prices. how do I design an efficient database

Right now, I am designing the database, as such I don't have any code. I am looking to use sql server, asp.net if that is relevant.
I have a big number of stores and a big number of products too, both in some thousands. For the same pId, prices may vary by sId. I would build it like this:
1. one "store" table containing fields (sId, name, location),
2. one "products" table containing fields (pId, name size, category, sub-category) and
3. "max(sId)" number of price tables containing fields (pId, mrp, availability).
where max(sId) is the total number of stores.
I would rather not make "max(pId)" number of tables containing fields (sId, mrp, availability) as I need to provide a UI to each store so that they can update the details about product prices and availability at their respective stores. I also need to display some products of a particular store but I never need to display some stores for any specific product. That is, search for stores by product is not required, but listing of products by store would be required.
Is this a good way or can I do better?
You appear to be on the right track and I will offer some recommendations. Although there is no requirement to display some stores for any specific products, you should always think about how the requirements will change and how your system can handle that. Build your system so that you can answer questions like these easily - What stores have product ABC priced under $3/piece?
Store table should contain, as you mentioned, information about stores. Take Aaron Bertrand's comment seriously. Name the fields in a way that the next developer can read and figure out what it is. User StoreID instead of sID.
StoreID StoreName ...other fields
------- --------------
1 North Chicago
2 East Los Angeles
Product table should contain information about products. It would be better to store category and sub-category into a different table.
ProductID ProductName ...other fields
--------- --------------
1 Bread
2 Soap
Categories can be located in its own table with hierarchal structure. See Hierarchal Data and how to use hierarchyid data type. This may help in finding out the depth of each top level category and help management decide if they are going overboard with categorization and making life miserable for everybody, including themselves unknowingly.
Many-to-many ProductCategory table can link products to categories. Also keep a history table. When a product's category is changed, keep track of what it was and what it is set to. It may help in answering questions such as - How many products were moved from Agriculture to Construction category in the last 6 months?
Many-to-many StoreProductPrice can bring together store and product and a price can be defined there. Also remember - prices may differ by customers also. Some customers may get discounts at a certain level. Although this may be too much to discuss here, it should be kept in the back of the mind in case a requirement to support customer discount structure comes up.
StoreProductID StoreID ProductID Price
-------------- ------- --------- -----
1 1 1 $4.00
2 1 2 $1.00
3 2 1 $4.05
4 2 2 $1.02
Availability of the product should be done through the inventory management database table(s). For example, you may have a master table of Warehouse and master table of Location. Bringing them together would be WearhouseLocation table. A WarehouseProduct table may bring together warehouse, product and units available.
Alternatively, your production or procurement facility might be dumping data into ProcuredProduct table. Your manufacturing unit might be putting locks on a subset of products while building something out of it. Your sales unit might be putting locks on a subset of products they are trying to sell. In other words, your products may be continually get allocated. You may run queries to find out availability of a certain product and that can be a little taxing. During any such allocation, the number of available units can be updated in a single table (which contains calculated available products that you can comfortably rely on).
So...depending on your customer's needs, the system you are building can get fairly complicated. I am recommending that you think about these things and keep your database structure flexible to anticipated changes. Normalization is a good thing, and de-normalization has its place also. Use them wisely.

COUNT_BIG in Indexed View

CREATE TABLE test2 (
id INTEGER,
name VARCHAR(10),
family VARCHAR(10),
amount INTEGER)
CREATE VIEW dbo.test2_v WITH SCHEMABINDING
AS
SELECT id, SUM(amount) as amount
-- , COUNT_BIG(*) as tmp
FROM dbo.test2
GROUP BY id
CREATE UNIQUE CLUSTERED INDEX vIdx ON test2_v(id)
I have error with this code:
Cannot create index on view
'test.dbo.test2_v' because its select
list does not include a proper use of
COUNT_BIG. Consider adding
COUNT_BIG(*) to select list.
I can create view like this:
CREATE VIEW dbo.test2_v WITH SCHEMABINDING
AS
SELECT id, SUM(amount) as amount, COUNT_BIG(*) as tmp
FROM dbo.test2
GROUP BY id
But I'm just wondering what is purpose of this column in this case?
You need COUNT_BIG in this case because of the fact you are using GROUP BY.
This is one of many limitations of Indexed Views and because of these restrictions, Indexed Views can't be used in many places or the usage of it is NOT as effective as it could have been. Unfortunately, it is how it works currently. Sucks, it narrows the scope of usage.
http://technet.microsoft.com/en-us/library/cc917715.aspx
Looks like it's simply a hardcoded performance-related restriction that the SQL Server team had to put in place when they first designed aggregate indexed views in SQL Server 2000.
Until relatively recently you could see this in the SQL 2000 technet documentation at http://msdn.microsoft.com/en-us/library/aa902643(SQL.80).aspx, but the SQL Server 2000 documentation has been definitely retired. You can still download a 92MB PDF file, and find the relevant notes on pages 1146 and 2190: https://www.microsoft.com/en-us/download/details.aspx?id=51958
An explanation for this restriction can be found on the SQLAuthority site - actually an excerpt from Itzik Ben-Gan's "Inside SQL" book: http://blog.sqlauthority.com/2010/09/21/sql-server-count-not-allowed-but-count_big-allowed-limitation-of-the-view-5/
It's worth noting that Oracle has the same restriction/requirement, for the same reasons (for an equivalent fast refreshable materialized view); see http://rwijk.blogspot.com.es/2009/06/fast-refreshable-materialized-view.html for a discussion on this topic.
Summary of the explanation:
Why does sql server logically need a materialized global count column in the indexed aggregate view?
So that it can quickly check / know whether a particular row in the aggregate view needs to change or go, when a given row of an underlying table is updated or deleted.
Why does this count column need to be COUNT_BIG(*)?
So that there is no possible risk of overflow; by forcing the use of the bigint datatype, there is no risk of an indexed view "breaking" when a particular row reaches an overly high count.
It's relatively easy to visualize why a count is critical to efficient aggregate view maintenance - imagine the following situation:
The table structures are as specified in the question
There are 4 rows in the underlying table:
ID | name | family | amount
--- | ---- | ------ | ------
1 | a | | 10
2 | b | | 11
2 | c | | 12
3 | d | | 13
The aggregate view is materialized to something like this:
ID | amount | tmp
--- | ------ | ---
1 | 10 | 1
2 | 23 | 2
3 | 13 | 1
Simple Case:
The SQL Engine detects a change in the underlying data - the third row in the source data (id 2, name c) is deleted.
The engine needs to:
find and update the relevant row of the aggregate materialized view
reduce the "amount" sum by the amount of the deleted underlying row
reduce the "count" by 1 (if this column exists)
Target/difficult case:
The SQL Engine detects another change in the underlying data - the second row in the source data (id 2, name b) is deleted.
The engine needs to:
find and delete the relevant row of the aggregate materialized view, as there are no more source rows with the same grouping key
Consider that the engine always has the "before" row of the underlying table(s) at view-update time - it knows exactly what changed in both cases.
The notable "step" in the materialized-view-maintenance algorithm is determining whether the target materialized aggregate row needs to be deleted or not
if you have a "count", you don't need to look anywhere beyond the target row - if you're dropping the count to 0, then delete the row. If you're updating to any other value, then leave the row.
if you don't have a count, then the only way for you to figure it out, would be to query the underlying table to check for any other rows with the same aggregation key; such a process would clearly introduce much more onerous restrictions:
it would be implicitly slower, and
in join-aggregation cases would be un-optimizable!
For these reasons, the existence of a count(*) column is a fundamental requirement of the aggregate materialized view implementation. Without a count(*) column, the real-time maintenance of an aggregate materialized view in the face of underlying data changes would carry an unacceptably high performance penalty!
You could still ask "why doesn't SQL Server create/maintain such a count column for me automatically when I create an aggregate materialized view?" - I don't have a particularly good answer for this. In the end, I imagine there would be more questions and confusion about "Why does my aggregate materialized view have a BIGCOUNT column if I didn't add it?" if they did that, so it's simpler to make it a basic requirement of creation of the object, but that's a purely subjective opinion.
I know this thread is a bit old but for those who still have this question, http://technet.microsoft.com/en-us/library/ms191432%28v=sql.105%29.aspx says this about indexed views
The SELECT statement in the view cannot contain the following Transact-SQL syntax elements:
The AVG, MAX, MIN, STDEV, STDEVP, VAR, or VARP aggregate functions. If AVG(expression) is specified in queries referencing the indexed view, the optimizer can frequently calculate the needed result if the view select list contains SUM(expression) and COUNT_BIG(expression). For example, an indexed view SELECT list cannot contain the expression AVG(column1). If the view SELECT list contains the expressions SUM(column1) and COUNT_BIG(column1), SQL Server can calculate the average for a query that references the view and specifies AVG(column1).

Resources