Dow Jones Index Data Warehouse - Correct Dimensions? - analytics

I want to create a "data warehouse" monitoring the stock market's 30 biggest companies (companies from Dow Jones index). My current concept for the model is the following, and the point is to be able to monitor individual stocks and then connect the gold price or oil price and watch how they correlate. I understand that dimensions are supposed to be descriptive, so those gold and oil dimensions are not really ideal, but I can't really put them in the fact table, because the metrics from the fact table are not contained in gold and oil prices. Same goes pretty much for the economy dimension, which monitors different economy indicators.
Dow Jones DW
Does this data warehouse actually make sense, fundamentally? Would it be possible to implement gold/oil (or other) prices to the warehouse for it to make sense?

Related

Designing gumy service relational database

I'm working on a small school project. I'm struggling with connecting the tables of DB (more specifically schedule - get all the info and connecting box, storage, and tires that are mentioned in the contract - i put aggregation). Here is a pic of my solution: https://ibb.co/eV1D8b . Does it look ok?
The task goes:
The company has more business office across the country. In addition to tire assembly, there are still several services: sales (tires and other products), tire maintenance, etc...
To change tires at the beginning of the season, it is necessary to order it by phone. The operator receives a call and for each box where it is possible to change the tires, see the booking records. Changing the tire takes 30 min.
Upon arrival, the user arrives at the counter where based on registration of the vehicle a work order is made for the employee in the box (on which the number of tire code contracts was recorded). The tires are then transported from warehouse to the box for the assembly. When the tires are stored, a contract is signed to ensure that after a contracted deadline of 6 months tires are kept for another 60 days, which is paid annually by $ 0.10 per tire daily. The price of tire storage depends on the size and is calculated per piece.
There is a possibility of joining the Loyalty Club and getting discounts for all family cars. The discount is valid only for services, not for products.
With companies that want the services of this tire service (rent-a-car, utility vehicles, etc...), a contract will be concluded to provide a more affordable price for tire assembly and storage.

Table Structure for Managing a Collection with Org Table

I have found org tables to be very powerful and useful. I feel like I have movement, table restructuring and basic formulas down fairly well. But I am having a difficult time wrapping my head around how I should structure this for tracking large collections. Not sure if I can do this in one table or if I need multiple tables.
Say I have a business that buys and sells trading cards. There are baseball, basketball and football cards. I want to track purchase price, sale price, purchase date, sale date, average sale price, last sale price, quantity in stock, and item condition for every card sold or in stock.
Is it possible to do this in a single table or do I need multiple tables?
I'd like to track statistics such as:
"What is the average price of all football cards sold in the last six months?"
"In the last month, did I buy more basketball cards or baseball cards?
And for a more lengthy example:
"Last year I sold 4 Mickey Mantle cards. 2 in Mint condition, 1 in Excellent condition, 1 in Poor condition and 1 unsold. What percentage of Mint Mickey Mantle cards were sold last year?"
To reiterate, in org-mode can all this be accomplished within a single table? How would it be structured if say, you knew Tops only made 2000 unique cards in a particular year, would table only contain 2000 rows? (plus the header)
If it can't be accomplished in a single table, I'm just going to use a postgres database structured much like the one mentioned here. I was really hoping there was a snazzy way to do this with org-table alone. But it looks like there are other ways to manipulate databases within emacs.
Sorry if most of this sounds like a high school math problem with no code but I'm sure most people (at least here) know what a single org table with the mentioned columns and a finite set of rows would look like.
Edit1: Can org references be used to link tables together to help get the results I'm looking for?
Edit2: The reason why I thought this was possible in org-mode, was because I did not think a foreign key was necessary. Here is a very similar example not using a foreign key. When reading about construction of spreadsheets in org-mode, foreign keys seemed to be the only obvious hurdle. Anyone have thoughts on this?

Database Design for a Person's Availability

I am currently working on a web application that stores information of Cooks in the user table. We have a functionality to search the cooks from our web application. If a cook is not available on May 3, 2016, we want to show the Not-Bookable or Not-Available message for that cook if user performs the search for May 3, 2016. The solution we have come up to is to create a table named CooksAvailability with following fields
ID, //Primary key, auto increment
IDCook, //foreign key to user's table
Date, //date he is available on
AvailableForBreakFast, //bool field
AvailableForLunch, //bool field
AvailableForDinner, //book field
BreakFastCookingPrice, //decimal nullable
LunchCookingPrice, //decimal nullable
DinnerCookingPrice //decimal nullable
With this schema, we are able to tell if the user is available for a specific date or not. But the problem with this approach is that it requires a lot of db space i.e if a cook is available for 280 days/year, there has to be 280 rows to reflect just one cook's availability.
This is too much space given the fact that we may have potentially thousands of cooks registered with our application. As you can see the CookingPrice fields for breakfast, lunch and dinner. it means a cook can charge different cooking rates for cooking on different dates and times.
Currently, we are looking for a smart solution that fulfils our requirements and consumes less space than our solution does.
You are storing a record for each day and the main mistake, which led you to this redundant design was that you did not separate the concepts enough.
I do not know whether a cook has an expected rate for a given meal, that is, a price one can assume in general if one has no additional information. If that is the case, then you can store these default prices in the table where you store the cooks.
Let's store the availability and the specific prices in different tables. If the availability does not have to store the prices, then you can store availability intervals. In the other table, where you store the prices, you need to store only the prices which deviate from the expected price. So, you will have defined availability intervals in a table, specific prices when the price differs from the expected one in the oter and default meal price values in the cook table, so, if there is no special price, the default price will be used.
To answer your question I should know more about the structure of the information.
For example if most cooks are available in a certain period, it could be helpful to organize your availability table with
avail_from_date - avail_to_date, instead of a row for each day.
this would reduce the amount of rows.
The different prices for breakfast, lunch and dinner could be stored better in the cooks table, if the prices are not different each day. Same is for the a availability for breakfast, lunch and dinner if this is not different each day.
But if your information structure makes it necessary to keep a record for every cook every day this would be 365 * 280 = 102,200 records for a year, this is not very much for a sql db in my eyes. If you put the indexes at the right place this will have a good performance.
There are a few questions that would help with the overall answer.
How often does availability change?
How often does price change?
Are there general patterns, e.g. cook X is available for breakfast and lunch, Monday - Wednesday each week?
Is there a normal availability / price over a certain period of time,
but with short-term overrides / differences?
If availability and price change at different speeds, I would suggest you model them separately. That way you only need to show what has changed, rather than duplicating data that is constant.
Beyond that, there's a space / complexity trade-off to make.
At one extreme, you could have a hierarchy of configurations that override each other. So, for cook X there's set A that says they can do breakfast Monday - Wednesday between dates 1 and 2. Then also for cook X there's set B that says they can do lunch on Thursday between dates 3 and 4. Assuming that dates go 1 -> 3 -> 4 -> 2, you can define whether set B overrides set A or adds to it. This is the most concise, but has quite a lot of business logic to work through to interpret it.
At the other extreme, you just say for cook X between date 1 and 2 this thing is true (an availability for a service, a price). You find all things that are true for a given date, possibly bringing in several separate records e.g. a lunch availability for Monday, a lunch price for Monday etc.

database normalization of a table

Let's consider I have the following not normalized table
1) warehouse
id
item_id
residual
purchase cost
sale cost
Currency
I tried to normalize this and I obtained this tables:
1) warehouse table
id
product_id
residual
cost_id
2) costs table
id
purchase cost
sale cost
Currency
Does that comply with database normal forms?
Thanks much in advance!!!
This should be a comment, but it's too verbose.
There's not enough information to provide an answer - we have to infer structure from context - and the context is confusing. Your initial record looks like a description of a product to be bought and sold - but you've named it as warehouse - which is a place for storing products. I've no idea what you mean by residual. Do you have multiple purchase costs for a specific product? If so how are they differentiated. Similar for sale cost. If ther are multiple costs involved why is the selling price tied to the purchase cost?
I don't know what "residual" means in this context. But just ignoring that ...
I doubt that there's anything to be gained by breaking cost out into a separate table. Let's say we have two products, "toaster model 14" and "men's shirt style X7". Both have a cost of $12. So you create a cost record for $12, and point both records to this. Then you realize that you made a mistake and the toaster really cost $13. So you update the cost record. But that will then update the cost for the shirt also, which is almost surely wrong. Having a separate cost table would mean that you would always create a new cost record every time you created a stock record. Nothing is gained.
The fields you have listed look to me like they all belong in one table. You'd also need an item table that would have data like the description, maybe manufacturer, product specs, etc.
Your warehouse table appears to really be a stocked item table, as it lists items and not warehouses, but whatever. I suspect it also needs some sort of serial number, or how will you link a given physical item in the warehouse to the corresponding record?
If by "sale cost" you mean the price that you will charge to the customer when you sell it, I doubt this belongs in the warehouse table. When a customer buys a product, do you tell him, "I can sell you the one that's in bin 40 in the warehouse for $20 or the one that's in bin 42 for $22. Which do you want?" Probably not. I suspect you charge the same price regardless of which particular unit the customer gets. The fact that the price you have to pay to your supplier went up between when you bought the first one and when you bought the second one normally does not mean that you will charge your customer a different price. You may raise the price, but you will have one price regardless of which unit is sold. Therefore, the selling price goes in the item table, not the warehouse table. If "sale cost" is something else, maybe this whole paragraph is irrelevant.

Change Data Capture and SQL Server Analysis Services

I'm designing a database application where data is going to change over time. I want to persist historical data and allow my users to analyze it using SQL Server Analysis Services, but I'm struggling to come up with a database schema that allows this. I've come up with a handful of schemas that could track the changes (including relying on CDC) but then I can't figure out how to turn that schema into a working BISM within SSAS. I've also been able to create a schema that translates nicely in to a BISM but then it doesn't have the historical capabilities I'm looking for. Are there any established best practices for doing this sort of thing?
Here's an example of what I'm trying to do:
I have a fact table called Sales which contains monthly sales figures. I also have a regular dimension table called Customers which allows users to look at sales figures broken down by customer. There is a many-to-many relationship between customers and sales representatives so I can make a reference dimension called Responsibility that refers to the customer dimension and a Sales Representative reference dimension that refers to the Responsibility dimension. I now have the Sales facts linked to Sales Representatives by the chain of reference dimensions Sales -> Customer -> Responsibility -> Sales Representative which allows me to see sales figures broken down by sales rep. The problem is that the Sales facts aren't the only things that change over time. I also want to be able to maintain a history of which Sales Representative was Responsible for a Customer at the time of a particular Sales fact. I also want to know where the Sale Representative's office was located at the time of a particular sales fact, which may be different than his current location. I might also what to know the size of a customer's organization at the time of a particular Sales fact, also which might be different than it is currently. I have no idea how to model this in an BISM-friendly way.
You mentioned that you currently have a fact table which contains monthly sales figures. So one record per customer per month. So each record in this fact table is actually an aggregation of individual sales "transactions" that occurred during the month for the corresponding dimensions.
So in a given month, there could be 5 individual sales transactions for $10 each for customer 123...and each individual sales transaction could be handled by a different Sales Rep (A, B, C, D, E). In the fact table you describe there would be a single record for $50 for customer 123...but how do we model the SalesReps (A-B-C-D-E)?
Based on your goals...
to be able to maintain a history of which Sales Representative was Responsible for a Customer at the time of a particular Sales fact
to know where the Sale Representative's office was located at the time of a particular sales fact
to know the size of a customer's organization at the time of a particular Sales fact
...I think it would be easier to model at a lower granularity...specifcally a sales-transaction fact table which has a grain of 1 record per sales transaction. Each sales transaction would have a single customer and single sales rep.
FactSales
DateKey (date of the sale)
CustomerKey (customer involved in the sale)
SalesRepKey (sales rep involved in the sale)
SalesAmount (amount of the sale)
Now for the historical change tracking...any dimension with attributes for which you want to track historical changes will need to be modeled as a "Slowly Changing Dimension" and will therefore require the use of "Surrogate Keys". So for example, in your customer dimension, Customer ID will not be the primary key...instead it will simply be the business key...and you will use an arbitrary integer as the primary key...this arbitrary key is referred to as a surrogate key.
Here's how I'd model the data for your dimensions...
DimCustomer
CustomerKey (surrogate key, probably generated via IDENTITY function)
CustomerID (business key, what you will find in your source systems)
CustomerName
Location (attribute we wish to track historically)
-- the following columns are necessary to keep track of history
BeginDate
EndDate
CurrentRecord
DimSalesRep
SalesRepKey (surrogate key)
SalesRepID (business key)
SalesRepName
OfficeLocation (attribute we wish to track historically)
-- the following columns are necessary to keep track of historical changes
BeginDate
EndDate
CurrentRecord
FactSales
DateKey (this is your link to a date dimension)
CustomerKey (this is your link to DimCustomer)
SalesRepKey (this is your link to DimSalesRep)
SalesAmount
What this does is allow you to have multiple records for the same customer.
Ex. CustomerID 123 moves from NC to GA on 3/5/2012...
CustomerKey | CustomerID | CustomerName | Location | BeginDate | EndDate | CurrentRecord
1 | 123 | Ted Stevens | North Carolina | 01-01-1900 | 03-05-2012 | 0
2 | 123 | Ted Stevens | Georgia | 03-05-2012 | 01-01-2999 | 1
The same applies with SalesReps or any other dimension in which you want to track the historical changes for some of the attributes.
So when you slice the sales transaction fact table by CustomerID, CustomerName (or any other non-historicaly-tracked attribute) you should see a single record with the facts aggregated across all transactions for the customer. And if you instead decide to analyze the sales transactions by CustomerName and Location (the historically tracked attribute), you will see a separate record for each "version" of the customer location corresponding to the sales amount while the customer was in that location.
By the way, if you have some time and are interested in learning more, I highly recommend the Kimball bible "The Data Warehouse Toolkit"...which should provide a solid foundation on dimensional modeling scenarios.
The established best practices way of doing what you want is a dimensional model with slowly changing dimensions. Sales reps are frequently used to describe the usefulness of SCDs. For example, sales managers with bonuses tied to the performance of their teams don't want their totals to go down if a rep transfers to a new territory. SCDs are perfect for tracking this sort of thing (and the situations you describe) and allow you to see what things looked like at any point historically.
Spend some time on Ralph Kimball's website to get started. The first 3 articles I'd recommend you read are Slowly Changing Dimensions, Slowly Changing Dimensions Part 2, and The 10 Essential Rules of Dimensional Modeling.
Here are a few things to focus on in order to be successful:
You are not designing a 3NF transactional database. Get comfortable with denormalization.
Make sure you understand what grain means and explicitly define the grain of your database.
Do not use natural keys as keys, and do not bake any intelligence into your surrogate keys (with the exception of your time keys).
The goals of your application should be query speed and ease of understanding and navigation.
Understand type 1 and type 2 slowly changing dimensions and know where to use them.
Make sure you have a sponsor on the business side with the power to "break ties". You will find different people in the organization with different definitions of the same thing, and you need an enforcer with the power to make decisions. To see what I mean, ask 5 different people in your organization to define "customer" or "gross profit". You'll be lucky to get 2 people to define either the same way.
Don't try to wing it. Read the The Data Warehouse Lifecycle Toolkit and embrace the ideas, even if they seem strange at first. They work.
OLAP is powerful and can be life changing if implemented skillfully. It can be an absolute nightmare if it isn't.
Have fun!

Resources