Building data warehouse dimension - database

I'm trying to create a data warehouse from which we will create all business reports. Already learned quite a lot about this and I have a general idea how to build the data warehouse. However, I came across a problem when I started to wonder how I could combine into a single data store information about products and sales from two separate OLTP databases.
ETL process looks like this:
1 Transfer product data from the first OLTP database table stgProducts
2 Merg product data from table to table stgProducts dimProducts - if the product is changed records are updated when there are new products that are added to new records.
3 Transfer product data from another database OLTP table stgProducts
4 Merg product data from table to table stgProducts dimProducts - if the product is changed records are updated when there are new products that are added to new records.
Similarly, the transfer is realized on sales data.
If I have one table with products how do I connect to the sales data from two different databases?
Speaking of the two databases, I mean two different ERP systems. One manages the online sales, the other handles other sales. SKU of the product is the same but the product ID is different for each system.

Assuming that you're using a star schema (not always the best approach, BTW), your product dimension table should have a key that is unique to the DW. Thus, SKU 186 might have a DW-specific key of 1 and SKU 294 might have a DW-specific key of 2. Your fact table, which holds 'transaction records' (sales records?), will have a compound key composed of multiple foreign key columns (eg. product_key, date_key, location_key, etc.).
This foreign key to the product table in this case is to the DW-specific product key, NOT to the source system SKU.
The ETL to populate your fact table should 'translate' the source system product key into the DW-specific product key as you bring the data in.
NOTE: This is the general approach to populating fact tables. There can be variations based on specific requirements.

Expanding Bens answer a bit with the caveate the there is no right answer the data warehouse - it is an expansive, mungeable area of IT with lots of schools of thought. Here is one possible direction you can pursue.
Assumptions:
1) You have 2 separate source databases both with 2 tables: Product and Sales
2) The separate source database are isloated and may have conflicting primary key data.
3) You want to version[1] both the product and the sales tables. This is an important assumption as mostly fact tables are not updated and the sales table sounds like it belongs as a nice static fact table. Your question in unclear on if you are expecting changes to sales so I will assume you will
4) A sales record can only ever be of 1 product (this is unlikely, but your question only mentions the 2 tables so ill work from there, a 1-many relation will involve more tweaking around the bridging table)
Warehouse Design:
You will need 3 tables with the following columns:
PRODUCT_DIM
PRODUCT_SK (Surrogate primary key data warehouse database generated)
SOURCE_SYSTEM_ID (DW specific indicator as to which source OLTP database the record was sourced from - can be string if you like)
PRODUCT_NK (PK of product from source system, used for SCD operations)
DATE_FROM (Record active from)
DATE_TO (Record active to (null for current))
PRODUCT_NAME (Product name for source table)
Other Columns (Any other product columns you may need)
SALES DIM
- SALES_SK (Surrogate primary key Data warehouse database generated)
- SOURCE_SYSTEM_ID (DW specific indicator as to which source OLTP database the record was sourced from - can be string if you like)
- SALES_NK (PK of sales record from source system, used for SCD operations)
- DATE_FROM (Record active from)
- DATE_TO (Record active to (null for current))
- SALE_AMOUNT (Product name for source table)
- Other Columns (Any other sales columns you may need)
PRODUCT_SALES_BRIDGE
- PRODUCT_SK (composite primary key)
- SALES_SK (composite primary key)
- DATE_FROM (Record active from)
- DATE_TO (Record active to (null for current))
The main things to note is the identifiers in the SALES and PRODUCT dim tables.
There is a Natural Key column for storing each records Primary Key value as exactly whats in the source system.
Because you have stated you have multiple source systems, the additional SOURCE_SYSTEM_ID column is required so you can match records from your multiple source systems to their equivalent record in your warehouse. Otherwise you might have a product of EGGS with an ID of 13 in your first source system and a product called MILK with an ID also of 13 in your second system. Without the additional SOURCE_SYSTEM_ID you will be forever cutting records for PRODUCT_DIM natural key 13. This will look somthing like this in your warehouse:
PRODUCT_SK SOURCE_SYSTEM_ID PRODUCT_NK .. PRODUCT_NAME
..
14 1 13 .. EGGS
15 2 13 .. MILK
..
The bridge table exists to prevent cutting of new SALES or PRODUCT records each time their related record changes. Consider a sale of 10$ with of Red Eggs. The next data, the Red Eggs product is renamed to "Super Red Eggs". This will result in a new PRODUCT record for Red Eggs in the warehouse. If the SALES table included a direct link to PRODUCT_SK, a new SALES record would to be cut solely because there was a new product SK for our Red Eggs. The bridging table moves the Referential Integrity Induced cutting of a new record from the DIMENSION/FACT table into the bridge table. This also has the added benfit of making new comers to the datawarehouse very aware that they are operating in a different mindset to the traditional RDBMS.
The 2 Natural Key columns should assist you to solve your original question, the bridge table is just personal preference and added for completeness - If you have a DW design already that works for you, stick with it.
[1] Ill use version to refer to mean what ever slowly changing dimension methodology you choose. Most people cheap out and just Type2 their entire tables 'just in case'

Related

Quick Books Data Layout in DB in SQL Server

I am new to QuickBooks. I am working on a staging SQL Server 2017 (v14) for grocery store data.
The QuickBook data was uploaded to server.And many tables are empty.
The datalayout is as in: https://doc.qodbc.com/qodbc/usa
I am looking to understand the data structure, to be able find the Purchasing Amount of Inventory, grouped by department per week.
The data is grocery store data. The QB has Payroll data tables. I am able to make sense of this payroll data.
But unable to find Purchasing Data- I do not see how the items can be grouped (class?) and where is the DateField (TxnDate?) and how do I summarize for a week.
There are some reports on QuickBooks that can brought into Excel; should I use that? Any pointers on which one?
I am not able to understand the column names ListIDs (a lot of this - may be descriptors) and Txn ID and TxnlineID.
Any pointers on how to understand how the inventory purchasing data is filed and kept in QBs- will help a lot.
https://support.flexquarters.com/esupport/index.php?/Knowledgebase/Article/View/2369/0/how-to-use-the-quickbooks-reporting-engine-with-qodbc
QuickBooks has two types of data, Lists and Transactions (Txn). The ListID is the primary key for the list table, and the TxnID is the primary key for transaction tables. If a transaction has line items (like a Bill) each line has it's own TxnLineID.
Inventory can be purchased (or returned) through four transactions: Bill, Check, CreditCardCharge, or VendorCredit (for returning inventory to vendor).
The Bill/Check/CC/VC tables will also have their corresponding LineItem tables, as these transactions can have more than one item purchased at a time. These will have the ItemLine after the parent table name, i.e. BillItemLine. Each of these lines will have a Item reference back to the ItemList table to know what item was purchased. The IDs that QuickBooks uses is (like 4651C-1355327815) is it's own internal generated ID, but it functions just like a primary key, and the other tables that have references (like ItemLineItemRefListID) are the Foreign Key to the other tables.
https://doc.qodbc.com/qodbc/usa/TableList.php?categoryName=Purchases shows all the purchasing transactions, but you only need to look at the ones that have ItemLines. Other purchasing transactions, like PurchaseOrders do NOT effect inventory quantities in QuickBooks. Only Bill, Check, CreditCardCharge, and VendorCredit have an effect.

What datatype should be used to store an array in the SQL Server

I am trying to create a table in my database using Visual Studio.
I've got a table for my Products (like in online shop) and then I have a table for Orders, which should store all products that user has ordered. The problem is that I am not sure which datatype I should use when designing the database to store an array of products in my Orders table. This is what the Orders table should look like
You should create Products and Orders table with relationship between them.
Your Orders table should have Id column as well (which is PrimaryKey)
Then you should create Products table, that keeps all the information about products and additionaly OrderId which should be used as Foreign Key to Orders table.
Please look at that link:
https://msdn.microsoft.com/en-us/library/ms189049.aspx
It's also worth of checking:
One To One, One To Many, Many To Many relations in SQLServer to have better understanding and design your data store properly.
In your case you need ProductsOrders table, Many To Many relationship.
In Relational database, you can create a relationship between 2 tables.
The relationship can be
1 to 1 (1 Product - 1 Order)
1 to Many (1 Product - 'n' Order)
Many to Many (n product - 'n' Order)
Based on your scenario, You can choose any of the relationship listed above. While querying from the database, you can easily operate over each order/Product.

How do I create a table in SQL Server that stores multiple values for one cell?

Suppose I have a table for purchase orders. One customer might buy many products. I need to store all these products and their relevant prices in a single record, such as an invoice format.
If you can change the db design, Prefer to create another table called PO_products that has the PO_Id as the foreign key from the PurchaseOrder table. This would be more flexible and the right design for your requirement.
If for some reason, you are hard pressed to store in a single cell (which I re-iterate is not a good design), you can make use of XMLType and store all of the products information as XML.
Note: Besides being bad design, there is a significant performance cost of storing the data as XML.
This is a typical example of an n-n relationship between customer and products.
Lets say 1 customer can have from 0 to N products and 1 products can be bought by 0 to N customers. You want to use a junction table to store every purchase orders.
This junction table may contain the id of the purchase, the id of the customer and the id of the product.
https://en.wikipedia.org/wiki/Many-to-many_(data_model)

Database table normalization (Many relations)

I have the following tables :
Items { item_id, item_name, item_price,......}
Divisions { division_id, division_name, division_address,.......}
Warehouses { warehouse_id, warehouse_name,........}
WarehousePartitions { partition_id, partition_name,........}
WarehouseRacks { rack_id, rack_name,.......}
Now, in order to track an item's location, I have the following table (a relation).
itemLocation { item_id, division_id, warehouse_id, partition_id, rack_id, floor_number}
It accurately tracks an item's location, but in order to get an items location info, I have to join five tables which can cause performance issues.
Also, the table doesn't have any Primary Key if we do not take the entire fields. Will this cause any issues ? and Is there a better way to accomplish this ?
Thanks.
Think in terms of relationships, since you're putting information in a relational database.
Here are my relationship guesses. Feel free to correct them.
A Division has one or more Warehouses.
A Warehouse has one or more Warehouse partitions.
A Warehouse partition has one or more Warehouse Racks.
A Warehouse rack has one or more items.
.
An item is located in a Warehouse rack.
A Warehouse rack is located in a Warehouse partition.
A Warehouse partition is located in a Warehouse.
A Warehouse is located in a Division.
I hope this helps with your database design.
Edited to add: I'll lay out the indexes for the tables. You should be able to create the rest of the columns.
Division
--------
Division ID
...
Warehouse
---------
Warehouse ID
Division ID
...
Warehouse Partition
-------------------
Warehouse Partition ID
Warehouse ID
...
Warehouse Rack
--------------
Warehouse Rack ID
Warehouse Partition ID
...
Item
----
Item ID
Warehouse Rack ID
Item Type ID
...
Item Type
---------
Item Type ID
Item name
Item Price
Each table has a primary ID blind key, probably an auto incrementing integer or an auto incrementing long.
All of the tables except Division have a foreign key that points back to the parent table.
A row in the Item table represents one item. One item can only be in one Warehouse Rack.
Modern relational databases should have no performance problems joining five tables. I've seen 30 table joins.
Build your system, and solve the actual performance problems that come up, rather than spending any time worrying about hypothetical performance problems.
As Gilbert Le Blanc writes, you probably don't need to join to five tables - you may only need to join to "WarehouseRacks".
However, you write that you need to "keep track of" - this suggests that there's a time aspect involved.
That gives you the following schema:
Items { item_id, item_name, item_price,......}
Divisions { division_id, division_name, division_address,.......}
Warehouses { warehouse_id, division_id, warehouse_name,........}
WarehousePartitions { partition_id, warehouse_id partition_name,........}
WarehouseRacks { rack_id, partition_id, rack_name,.......}
ItemLocation (rack_id, item_id, entry_time, quantity, floor_number)
In ItemLocation, all 3 columns are part of a composite primary key - you're effectively saying "there can only be one instance of an item in a given place at any one time".
You still have to join to five tables to retrieve an item ID (at least if you want the addresses and names and such). Assuming you have modern hardware and database software, this should be fine - uUnless you're working with huge amounts of data, a 5-way join on a foreign/primary key relationship is unlikely to cause performance issues. Given the quantities you mention in the comment, and the fact you'll be running this on MySQL, I don't think you need to worry about the number of joins.
The benefit of this model is that you simply cannot insert invalid data into the item location table - you can't say that the item is in a rack which doesn't exist in the partition, or a warehouse that doesn't exist in the division; if a warehouse changes division, you don't have to update all the item_location records.
I've created a SQLFiddle to show how it might work.
The "item_location" table is the biggest concern in this - you have to choose whether to store a snapshot (which is what this design does), or a transaction table. With "snapshot" views, your code always updates the "quantity" column, effectively saying "as of entry_time, there are x items in this floor in this rack".
The "transaction" model allows you to insert multiple records - typically positive quantities when adding items, and negative quantities when removing them. The items in that location at any point in time are the SUM of those quantities up to the desired time.

Basic table structure question

I have Brand and Company. 1 Company can have 1 or more Brands.
As an example, company has company_id, company_name. Similarly Brands has brand_id and brand_name. Now can i add the FK column company_id to brands also and the relationship is complete in 2 tables or do i need a 3rd table like Company_Brands which will have company_id, brand_id and the default PK?
I am not asking for an ideal text book way this should be done but in a high transaction environment where performance is important so less query stain and also where writes will be high along with data will change in tables as this is a user content site so information may not be accurate and thus edited constantly.
Just add the foreign key company_id to the brands table. You have described a 1 to many relationship i.e. 1 company can have many brands, but 1 brand cannot have many companies.
You would only need the junction table if you had a many to many relationship.

Resources