I have overheard discussion like "Should this be put into a ledger instead of an update?" I have some feeling that is has to do with record keeping, but I cannot fully understand what is a ledger. Searching both on stackoverflow and google runs into accounting related articles.
So my question is, what is a ledger when talking about database applications?
A ledger usually refers to a collection of states through which an entity passed. The difference between an update and storing the data in a ledger is that when using update you don't have a history of all the updates performed on a certain entity.
The most common example for ledgers is indeed a banking model. You can better see the difference in the example below:
With updates, every time a client withdraws or deposits money, you just update the amount of money that client owns:
user_id | ammount
-----------------------------
26KRZT | 45
Having a ledger, you can keep the entire history of the transactions (and compute the amount based on the client's transactions)
user_id | operation | ammount
----------------------------------------------------
26KRZT | DEPOSIT | 25
26KRZT | DEPOSIT | 35
26KRZT | WITHDRAW | 15
Basically, a ledger stores data in the database as diffs (updates to the previous version of an entity) in order to be able to get a change history for a given entity.
Related
What do people do if there is non-unique data on files which can and should be joined?
My example is for customer data. One file would track the start of an interaction and how long it took from a system perspective. The other file would keep track of the interaction when the employee logs it - this is typically done at the end of the interaction but there can be delays. So there is no way to match up the timestamps between File 1 and File 2. I would want to determine the Duration and Rating for specific Issue types across 3 files.
I typically create an index (in pandas) which is Date | CustomerID | EmployeeID which works decently most of the time (that customer interacted with that employee on that date). But sometimes the same customer interacts with the same customer on the same day so I have a duplicate value. This didn't bother me before until I notice my joins (pd.merge) caused duplicate data and, by chance, an outlier interaction was duplicated which threw off some analysis.
Should I completely drop any interaction with duplicates? Should I create a more unique ID based on some type of time interval (like if the EndDatetime is within X minutes from the Datetime on another file (which is close to the end of the interaction normally)?
File 1:
StartDatetime | CustomerID | EmployeeID | Duration | EndDatetime
File 2:
Datetime | CustomerID | EmployeeID | Issue
File 3:
Datetime | CustomerID | EmployeeID | Rating
I believe the correct answer to this question relies more on the use cases for your data than anything else. Personally I deal with interaction data a lot, in those cases I prefer indexing by interaction time as well, since the two interactions are truly unique. However, if the analysis I'm performing doesn't take into account the amount of interactions taking place, and just the parties involved, dropping duplicate interactions is preferred. In other cases, grouping is preferable, but since each interaction in your example appears to be truly independent, grouping seems ill-advised, the only criteria you could naturally group on would be rating, and seems like a bad decision to aggregate that separately from any analytics you're performing.
I have to design a database. And I am finding entities and their relationships. But every relationships seems to have a many-to-many relationships. For instance, in my case:
1) A staff manages client
Here a staff can manage zero or more client. Similarly, a client is managed by one or more client.
2) A client orders to buy a stock
Here a client can order zero, one or more stock to buy and a stock can be ordered by zero, one or more client.
3) A client orders to sell a stock
Here a client can order zero, one or more stock to sell and a stock can be ordered by zero, one or more client to sell.
These are some of the examples of my situation. And I am confused how to separate these relationships. There are other numerous cases like these in my scenario. And I am having difficulty to conceptualize the design.
SO, please enlighten me regarding my situation.
It seems like there's quite a lot to the system you are developing and presumably there are requirements you haven't mentioned so it isn't really possible to come up with a complete answer. But hopefully some of the following will help you to "conceptualize the design" as you describe it.
1) This is a very common scenario and there's pretty much a standard way of dealing with these many-to-many relationships.
If there are 2 entities A and B with a many-to-many relationship then you would normally introduce an entity C that consists of 2 columns - one a foreign key to the unique id of A and the other a foreign key to the unique id of B. And you would remove the foreign key column in entity A pointing to B and vice versa.
i.e
|-----|
| A |
|-----|
\|/
|
|
/|\
|-----|
| B |
|-----|
becomes:
|-----| |-----|
| A | | B |
|-----| |-----|
| |
| |
| |
/|\ /|\
|-------------|
| C |
|-------------|
The main challenge is often what to call these new entities! Sometimes they might just be something like a_b_relationship but it's good if can identify more meaningful names.
2) It looks like you need to do a bit more analysis to identify all the actual entities. One way of doing this is to go through your description of the system and identify the nouns - often if there's a noun in the description it's appropriate to have an entity in the entity-relationship diagram.
"Order" jumps out as a noun you overlooked.
Typically for order-processing you would have 2 entities - the order that contains the date, total value, customer etc, and a child orderline which identifies how many of which product have been ordered and individual prices. So in ecommerce a shopping cart would be the order and each item in the shopping cart would be an orderline record.
In your scenario we'd have:
|----------| |-----------|
| client | | product |
|----------| |-----------|
| |
| |
| |
/|\ /|\
|-------------| /|-------------|
| order |--------| orderline |
|-------------| \|-------------|
3) Client sells many products
Here you are identifying an additional role for a client and what I'd do here is question whether "client" is an appropriate entity at this stage. You may find it easier to think in terms of "buyer" and "seller" until the first-cut design is understood. If buyer and seller have a lot in common (especially if an individual can be both a buyer and a seller) then you may decide to use a single entity eventually. Your ERD tool may provide support for this - have a search for "subtype entities" or "entity subtypes".
The specifics will depend on your actual application but it could be that each orderline should have a relationship to the seller, and the order a relationship to the buyer. This will depend on whether it is possible for example for a buyer to order a number of items of a particular product, some of which are sourced from one seller and some from another. It could get complicated!
Also, it might be helpful to consider whether you need to record a seller's stock prior to it being sold. Here it might be useful to distinguish between "product" and "stock", e.g.
|---------| |-----------|
| seller | | product |
|---------| |-----------|
| |
| |
| |
/|\ /|\
|-----------------|
| stock |
|-----------------|
As a general comment I'd say it really can help to go through the design process step by step. So once you have got your initial model, assign all the data items you need to store to the appropriate entity, and methodically make sure that the design is in first normal form, then second normal form then 3rd normal form. Only once you have done this, and are confident that the design reflects the requirements, should you think about how to implement the design in a database. That's what I learned many years ago anyway!
That's hard to answer this question. Everything in designing is situational. If you really need to store which of staffs manage the client and a client could managed by many staffs, yes your relation is many-to-many. Pay attention there is many relations between entities in real world; you should just store which of them are important and necessary to be stored.
For another example, if your stock contain the available count of that kind of goods, thus the relation between client and stock is many-to-many too.
Note: Don't use plural form of noun for you tables' name, it leads you to be confused about relations.
Edit: To apply many-to-many relationship in your database tables, you will need a mediator table. For example about Customer and Product table, you should create a table named CustomerProduct (Or everything you want). CustomerProduct table contains two foreign keys, one from Customer table and another from Product table. Usually (not all the time) one many-to-many relationship breakdown to two many-to-one relationships.
See this Link .
We are building a data warehouse and having an issue with construction of a fact table. Our company enrolls members on various levels of membership, promotions, and rates. We would like to create a fact table that could display all current membership information with the ability to roll out old information or changes to the membership. The tables we are using are as follows…
(Member Status | Start Date |End Date)
(Member Privilege | Price | Cycle | Start Date | End Date)
(Member Promotion | Associated Discount | Start Date | End Date)
(Member Type | Start Date | End Date)
(Personal Information | Address | Phone | Ect)
We would like the fact table to display all this information based upon Personal Id. (Personal information needs to remain a dimensional table without many to many relationships due to this information is used in other fact tables.) The issue is that we could have multiple privileges or promotions attached to one member. Is it better to have these be independent fact tables or what is the best way to get this information into one table?
I'm designing a database that will hold a list of transactions. There are two types of transactions, I'll name them credit (add to balance) and debit (take from balance).
Credit transactions most probably will have an expiry, after which this credit balance is no longer valid, and is lost.
Debit transactions must store from which credit transaction they come from.
There is always room for leniency with expiry dates. It does not need to be exact (till the rest of the day for example).
My friend and I have came up with two different solutions, but we can't decide on which to use, maybe some of you folks can help us out:
Solution 1:
3 tables: Debit, Credit, DebitFromCredit
Debit: id | time | amount | type | account_id (fk)
Credit: id | time | amount | expiry | amount_debited | accountId (fk)
DebitFromCredit: amount | debit_id (fk) | credit_id (fk)
In table Credit, amount_debited can be updated whenever a debit transaction occurs.
When a debit transaction occurs, DebitFromCredit holds information of which credit transaction(s) has this debit transaction been withdrawn.
There is a function getBalance(), that will get the balance according to expiry date, amount and amount_debited. So there is no physical storage of the balance; it is calculated every time.
There is also a chance to add a cron job that will check expired transactions and possibly add a Debit transaction with "expired" as a type.
Solution 2
3 tables: Transactions, CreditInfo, DebitInfo
Transactions: id | time | amount (+ or -) | account_id (fk)<br />
CreditInfo: trans_id (fk) | expiry | amount | remaining | isConsumed<br />
DebitInfo: trans_id (fk) | from_credit_id (fk) | amount<br />
Table Account adds a "balance" column, which will store the balance. (another possibility is to sum up the rows in transactions for this account).
Any transaction (credit or debit) is stored in table transactions, the sign of the amount differentiates between them.
On credit, a row is added to creditInfo.
On debit one or more rows are added to DebitInfo (to handle debiting from multiple credits, if needed). Also, Credit info row updates the column "remaining".
A cron job works on CreditInfo table, and whenever an expired row is found, it adds a debit record with the expired amount.
Debate
Solution 1 offers distinction between the two tables, and getting data is pretty simple for each. Also, as there is not a real need for a cron job (except if to add expired data as a debit), getBalance() gets the correct current balance. Requires some kind of join to get data for reporting. No redundant data.
Solution 2 holds both transactions in one table, with + and - for amounts, and no updates are occurring to this table; only inserts. Credit Info is being updated on expiry (cron job) or debiting. Single table query to get data for reporting. Some redundancy.
Choice?
Which solution do you think is better? Should the balance be stored physically or should it be calculated (considering that it might be updated with cron jobs)? Which one would be faster?
Also, if you guys have a better suggestion, we'd love to hear it as well.
Which solution do you think is better?
Solution 2. A transaction table with just inserts is easier for financial auditing.
Should the balance be stored physically or should it be calculated (considering that it might be updated with cron jobs)?
The balance should be stored physically. It's much faster than calculating the balance by reading all of the transaction rows every time you need it.
I am IT student that has passed a course called databases, pardon my inexperience.
I made this using MySQL workbench can send you model via email to you do not lose time recreating the model from picture.
This schema was made in 10 minutes. Its holding transactions for a common shop.
Schema explanation
I have a person who can have multiple phones and addresses.
Person makes transactions when he is making a transaction,
You input card name e.g. american express,
card type credit or debit (MySQL workbench does not have domains or constrains as power-designer as far as i know so i left field type as varchar) should have limited input of string debit or credit,
Card expiry date e.g. 8/12,
Card number e.g. 1111111111
Amount for which to decrease e.g. 20.0
time-stamp of transaction
program enters it when entering data
And link it to person that has made the transsaction
via person_idperson field.
e.g.
1
While id 1 in table person has name John Smith
What all this offers:
transaction is uniquely using a card. Credit or Debit cant be both cant be none.
Speed less table joins more speed for system has.
What program requires:
Continuous comparion of fields variable exactTimestamp is less then variable cardExpiery when entering a transaction.
Constant entering of card details.
Whats system does not have
Saving amount that is used in the special field, however that can be accomplished with Sql query
Saving amount that remains, I find that a personal information, What I mean is you come to the shop and shopkeeper asks me how much money do you still have?
system does not tie person to the card explicitly.
The person must be present and use the card with card details, and keeping anonymity of the person. (Its should be a query complex enough not to type by hand by an amateur e.g. shopkeeper) , if i want to know which card person used last time i get his last transaction and extract card fields.
I hope you think of this just as a proposition.
I'm designing a database application where data is going to change over time. I want to persist historical data and allow my users to analyze it using SQL Server Analysis Services, but I'm struggling to come up with a database schema that allows this. I've come up with a handful of schemas that could track the changes (including relying on CDC) but then I can't figure out how to turn that schema into a working BISM within SSAS. I've also been able to create a schema that translates nicely in to a BISM but then it doesn't have the historical capabilities I'm looking for. Are there any established best practices for doing this sort of thing?
Here's an example of what I'm trying to do:
I have a fact table called Sales which contains monthly sales figures. I also have a regular dimension table called Customers which allows users to look at sales figures broken down by customer. There is a many-to-many relationship between customers and sales representatives so I can make a reference dimension called Responsibility that refers to the customer dimension and a Sales Representative reference dimension that refers to the Responsibility dimension. I now have the Sales facts linked to Sales Representatives by the chain of reference dimensions Sales -> Customer -> Responsibility -> Sales Representative which allows me to see sales figures broken down by sales rep. The problem is that the Sales facts aren't the only things that change over time. I also want to be able to maintain a history of which Sales Representative was Responsible for a Customer at the time of a particular Sales fact. I also want to know where the Sale Representative's office was located at the time of a particular sales fact, which may be different than his current location. I might also what to know the size of a customer's organization at the time of a particular Sales fact, also which might be different than it is currently. I have no idea how to model this in an BISM-friendly way.
You mentioned that you currently have a fact table which contains monthly sales figures. So one record per customer per month. So each record in this fact table is actually an aggregation of individual sales "transactions" that occurred during the month for the corresponding dimensions.
So in a given month, there could be 5 individual sales transactions for $10 each for customer 123...and each individual sales transaction could be handled by a different Sales Rep (A, B, C, D, E). In the fact table you describe there would be a single record for $50 for customer 123...but how do we model the SalesReps (A-B-C-D-E)?
Based on your goals...
to be able to maintain a history of which Sales Representative was Responsible for a Customer at the time of a particular Sales fact
to know where the Sale Representative's office was located at the time of a particular sales fact
to know the size of a customer's organization at the time of a particular Sales fact
...I think it would be easier to model at a lower granularity...specifcally a sales-transaction fact table which has a grain of 1 record per sales transaction. Each sales transaction would have a single customer and single sales rep.
FactSales
DateKey (date of the sale)
CustomerKey (customer involved in the sale)
SalesRepKey (sales rep involved in the sale)
SalesAmount (amount of the sale)
Now for the historical change tracking...any dimension with attributes for which you want to track historical changes will need to be modeled as a "Slowly Changing Dimension" and will therefore require the use of "Surrogate Keys". So for example, in your customer dimension, Customer ID will not be the primary key...instead it will simply be the business key...and you will use an arbitrary integer as the primary key...this arbitrary key is referred to as a surrogate key.
Here's how I'd model the data for your dimensions...
DimCustomer
CustomerKey (surrogate key, probably generated via IDENTITY function)
CustomerID (business key, what you will find in your source systems)
CustomerName
Location (attribute we wish to track historically)
-- the following columns are necessary to keep track of history
BeginDate
EndDate
CurrentRecord
DimSalesRep
SalesRepKey (surrogate key)
SalesRepID (business key)
SalesRepName
OfficeLocation (attribute we wish to track historically)
-- the following columns are necessary to keep track of historical changes
BeginDate
EndDate
CurrentRecord
FactSales
DateKey (this is your link to a date dimension)
CustomerKey (this is your link to DimCustomer)
SalesRepKey (this is your link to DimSalesRep)
SalesAmount
What this does is allow you to have multiple records for the same customer.
Ex. CustomerID 123 moves from NC to GA on 3/5/2012...
CustomerKey | CustomerID | CustomerName | Location | BeginDate | EndDate | CurrentRecord
1 | 123 | Ted Stevens | North Carolina | 01-01-1900 | 03-05-2012 | 0
2 | 123 | Ted Stevens | Georgia | 03-05-2012 | 01-01-2999 | 1
The same applies with SalesReps or any other dimension in which you want to track the historical changes for some of the attributes.
So when you slice the sales transaction fact table by CustomerID, CustomerName (or any other non-historicaly-tracked attribute) you should see a single record with the facts aggregated across all transactions for the customer. And if you instead decide to analyze the sales transactions by CustomerName and Location (the historically tracked attribute), you will see a separate record for each "version" of the customer location corresponding to the sales amount while the customer was in that location.
By the way, if you have some time and are interested in learning more, I highly recommend the Kimball bible "The Data Warehouse Toolkit"...which should provide a solid foundation on dimensional modeling scenarios.
The established best practices way of doing what you want is a dimensional model with slowly changing dimensions. Sales reps are frequently used to describe the usefulness of SCDs. For example, sales managers with bonuses tied to the performance of their teams don't want their totals to go down if a rep transfers to a new territory. SCDs are perfect for tracking this sort of thing (and the situations you describe) and allow you to see what things looked like at any point historically.
Spend some time on Ralph Kimball's website to get started. The first 3 articles I'd recommend you read are Slowly Changing Dimensions, Slowly Changing Dimensions Part 2, and The 10 Essential Rules of Dimensional Modeling.
Here are a few things to focus on in order to be successful:
You are not designing a 3NF transactional database. Get comfortable with denormalization.
Make sure you understand what grain means and explicitly define the grain of your database.
Do not use natural keys as keys, and do not bake any intelligence into your surrogate keys (with the exception of your time keys).
The goals of your application should be query speed and ease of understanding and navigation.
Understand type 1 and type 2 slowly changing dimensions and know where to use them.
Make sure you have a sponsor on the business side with the power to "break ties". You will find different people in the organization with different definitions of the same thing, and you need an enforcer with the power to make decisions. To see what I mean, ask 5 different people in your organization to define "customer" or "gross profit". You'll be lucky to get 2 people to define either the same way.
Don't try to wing it. Read the The Data Warehouse Lifecycle Toolkit and embrace the ideas, even if they seem strange at first. They work.
OLAP is powerful and can be life changing if implemented skillfully. It can be an absolute nightmare if it isn't.
Have fun!