I am a CS student working on a small personal project. It's a platform where a user can book an appointment with a freelance photographer. I am going to use Postgres for this project and I still have few issues when I am designing my database.
Photographer:
photographer_id (PK)
name (string)
registration_date (string)
company_id (int) # as photographers are not full time employee, they do have their own company
id_path (string) # id picture path (passport for example)
rating (int)
User:
user_id (PK)
name (string)
registration_date (string)
Shooting:
shooting_id (PK)
photographer_id (FK)
user_id (FK)
date (date)
pictures_path (string) #path where the pictures of the shooting are stored
location_id (FK)
Location:
location_id (PK)
country (String)
city (String)
A user is requesting a shooting for a given day, time and location. A shooting is always 30 minutes. Once a user is making an appointment, the platform have the kind a photographer available at the given day, time and location. Example: user_z is requesting a shooting at Berlin on November 23th at 3:00 pm. The platform will look for all photographers available at that given time and location. Let' say that on November 23th at 3:00 pm in Berlin we have 3 photographers available:
photographer_id | name
------------------------
3746 | Thomas
5436 | Sofia
0835 | Maria
We are not going to reach out the 3 of them and the faster to reply will get the job. Instead, each photographer will have a score based on 2 criteria: 1) rating and 2) cancellation rate
the cancellation rate is the number of appointments a photographer cancel after accepting the job.
So photographers are going to be ranked based on that score. If we come back to the example for Berlin on November 23th at 3:00 pm the queue looks like this:
photographer_id | name | score
--------------------------------
3746 | Thomas | 0.32
5436 | Sofia | 0.48
0835 | Maria | 0.95
The platform will send a request to Maria first. If she does not reply after 30 minutes or she decline, we will send a request to Sofia and so on. I create a new table for that but I am still not very sure about the structure and I still have a lot of things to figure out.
Requests:
request_id (PK)
user_id (FK) #from the user table
date_of_request (date) #when the request was made by the user on the platform
location_id (FK) # from the location table, where the client wants to have a shooting
date (date) # when the client wants to have a shooting
cancelled (boolean) # true is the request was cancelled by the client or no photographer was found for the given location and time
I don't know if what I doing make sense at all but I have some difficulties with the request table. -
Should I had another column, photographer_id (fk) from the photographer table or shooting_id (fk) from the shooting table ? Because if a photographer accept a request, I need to be able to link it? My issue is the requests and shooting table have a lot of duplicate columns.
My second issue is with the score to rank photographers. I need two elements to calculate that score
rating and cancellation rate. I already have the rating in the photographer table but I don't have yet the cancellation rate score. I am still figuring out how I will be able to calculate that score on the platform (Flask) but I guess I will need a dedicated table for that score?
If you have some others suggestions, it would be super helpful!
Actually what you have is pretty good. But I have indicated some suggestions for you to consider.
Photographer:
photographer_id (PK)
name (string)
registration_date (string)
company_id (int) # as photographers are not full time employee, they do have their own company
id_path (string) # id picture path (passport for example)
rating (int)
Changes
protogrepher_id int generated identity
registration_date should be type date not string
add: contact information, unless that is in company
Client:
client_id (PK)
name (string)
registration_date (string)
Changes
make name client (indicates the purpose of entity, further within IT user often has bad connotation)
client_id int generated identity
registration_date should be type date not string
add contact information (see Requests)
Location:
location_id (PK)
country (String)
city (String)
change
protogrepher_id int generated identity
registration_date should be type date not string
add: address (what happens if there are multiple locations in same city
say from your example Berlin- a city of 4M inhabitants covering 30K sq km)
Requests:
request_id (PK)
client_id (FK) #from the client table
date_of_request (date) #when the request was made by the client on the platform
location_id (FK) # from the location table, where the client wants to have a shooting
date (date) # when the client wants to have a shooting
cancelled (boolean) # true is the request was cancelled by the client or no photographer was found for the given location and time
change
protographer_id int generated identity
registration_date should be type date not string
date should be timestamp as that included the time as well as date
add: status Pending - Request made, but no photographer
Accepted - Photographer has accepted job
Canceled - The request is canceled, can indicate why/who canceled
add: contact information (see Client) - need to be able to notify client when accepted or canceled. That information could come from client table
Shooting:
shooting_id (PK)
photographer_id (FK)
client_id (FK)
location_id (FK)
date (date)
pictures_path (string) #path where the pictures of the shooting are stored
change
shooting_id int generated identity
registration_date should be type date not string
This is the decision do you need to make: keep this as an individual table, or combined into Requests. Either way a session is an Approved Request. As you indicated the columns overlap considerable - but that is not necessary. You can combine them bu moving photographer_id to Requests and eliminating Shootings or keep Shootings but eliminate Client_id, location_id, and date then adding Request_Id. There is no loss of information either way.
Your specific questions:
Add Shooting_id to Photographer. Absolutely Not. What happens if
the photographer has multiple shootings? You can get the needed
information through the FK from Shooting to Photographer.
I cannot address rank as you are still figuring it out. However, I
would lean toward not storing it at all as it's a calculated value. If
you store it it becomes a static value, and a maintenance issue. How
often does updated.
MISSING Elements: Beside the indicated Add columns
You have nothing to indicate a job offer sent to a photographer nor
any method of capturing a negative response from them.
You indicate scoring is dependent upon photographer's cancellation
rate. But you have nothing to capture that. Perhaps the suggested
Status in Requests but that's for you to decide.
Related
Trying to define the right schema / table for our scenario:
We have few hundreds of eCommerce sites, each one of them has unique siteId.
Each site has it own end-users, up to 10M unique users per month. Each user has unique userId.
Each end-user interacts with the site: view products, add products to cart and purchase products (we call it user events). I want to store the activities of the last 30 days (or 180 days if it possible).
Things to consider:
Site sizes are different! We have some "heavy" sites with 10M end users but we also have "light" sites with a few hundreds/thousands of users.
Events don't have unique ids.
Users can have more than one event at a time, for example they can a view page with more than one product (but we could live without that restriction to simplify).
Rough estimation: 100 Customers x 10M EndUsers x 100 Interactions = 100,000,000,000 rows (per month)
Writes done in realtime (when the event arrive to the server). Reads done much less (1% of the events).
Events have some more metadata and different events (view/purchase/..) have different metadata.
Using Keyspace to separate between sites, and manage table per each site vs. all customers in one table.
How to define the key here?
+--------+---------+------------+-----------+-----------+-----------+
| siteId | userId | timestamp | eventType | productId | other ... |
+--------+---------+------------+-----------+-----------+-----------+
| 1 | Value 2 | 1501234567 | view | abc | |
| 1 | cols | 1501234568 | purchase | abc | |
+--------+---------+------------+-----------+-----------+-----------+
My query is: Get all events (and their metadata) of specific user. As I assumed above, around 100 events.
Edit2:I guess it wasn't clear, but the uniqueness of users is per site, two different users might have the same id if they are on different sites
If you want to query for the userid than the userid should be the first part of your compound primary key (this is the partition key). Use a compound primary key to create columns that you can query to return sorted results. I would suggest the following schema:
CREATE TABLE user_events (
userid long,
timestamp timestamp,
event_type text,
site_id long,
product_id long,
PRIMARY KEY (userid, site_id, timestamp, product_id));
That should make queries like
SELECT * FROM user_events WHERE user_id = 123 and site_id = 456;
quite performant. By adding the timestamp to the PK you can also easily LIMIT your queries to get only the top(latest) 1000 (whatever you need) events without getting into performance issues because of high active users (or bots) having a very long history.
One thing to keep mind: I would recommend to have the user_id or a composition of user_id, site_id as the partition key (the first part of the primary key). That will prevent your rows from becoming too big.
So an alternative design would look like this:
CREATE TABLE user_events (
userid long,
timestamp timestamp,
event_type text,
site_id long,
product_id long,
PRIMARY KEY ( (userid, site_id), timestamp, product_id));
The "downside" of this approach is that you always have to provide user and site-id. But I guess that is something that you have to do anyways, right?
To point out one thing. The partition key (also called to row id) identifies a row. A row will stay on specific node. For this reason it is a good idea to have the rows more or less of the same size. A row with a couple of thousands or 10ks of columns is not really a problem. You will get problems if you have some rows with millions of columns and other rows with only 10-20 columns. That will cause the cluster to be inbalanced. Furthermore it makes the row caches less effictive. In your example I would suggest to avoid to have the site_id as the partition key (row key).
Does that make sense to you? Maybe the excelent answer to this post give you some more insides: difference between partition-key, composite-key and clustering-key. Furthermore a closer look at this part of the datastax documentation offers some more details.
Hope that helps.
My query is: Get all events (and their metadata) of specific user. As I assumed above, around 100 events.
So, you want all the events of a given user. As each user has a unique id on a site, so you can form the table using userid and site_id as a primary key and timestamp as clustering key. Here is the table structure:
CREATE TABLE user_events_by_time (
userid bigint,
timestamp timestamp,
event_type text,
product_id bigint,
site_id bigint,
PRIMARY KEY ((site_id,userid), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC) ;
Now you can query all of a user's event in a given time by using the following query:
SELECT * from user_events_by_time WHERE site_id= <site_id> and userid = <user_id> and timestamp > <from_time> and timestamp < <to_time>;
Hope this solves your problem.
I'm designing a database that will hold a list of transactions. There are two types of transactions, I'll name them credit (add to balance) and debit (take from balance).
Credit transactions most probably will have an expiry, after which this credit balance is no longer valid, and is lost.
Debit transactions must store from which credit transaction they come from.
There is always room for leniency with expiry dates. It does not need to be exact (till the rest of the day for example).
My friend and I have came up with two different solutions, but we can't decide on which to use, maybe some of you folks can help us out:
Solution 1:
3 tables: Debit, Credit, DebitFromCredit
Debit: id | time | amount | type | account_id (fk)
Credit: id | time | amount | expiry | amount_debited | accountId (fk)
DebitFromCredit: amount | debit_id (fk) | credit_id (fk)
In table Credit, amount_debited can be updated whenever a debit transaction occurs.
When a debit transaction occurs, DebitFromCredit holds information of which credit transaction(s) has this debit transaction been withdrawn.
There is a function getBalance(), that will get the balance according to expiry date, amount and amount_debited. So there is no physical storage of the balance; it is calculated every time.
There is also a chance to add a cron job that will check expired transactions and possibly add a Debit transaction with "expired" as a type.
Solution 2
3 tables: Transactions, CreditInfo, DebitInfo
Transactions: id | time | amount (+ or -) | account_id (fk)<br />
CreditInfo: trans_id (fk) | expiry | amount | remaining | isConsumed<br />
DebitInfo: trans_id (fk) | from_credit_id (fk) | amount<br />
Table Account adds a "balance" column, which will store the balance. (another possibility is to sum up the rows in transactions for this account).
Any transaction (credit or debit) is stored in table transactions, the sign of the amount differentiates between them.
On credit, a row is added to creditInfo.
On debit one or more rows are added to DebitInfo (to handle debiting from multiple credits, if needed). Also, Credit info row updates the column "remaining".
A cron job works on CreditInfo table, and whenever an expired row is found, it adds a debit record with the expired amount.
Debate
Solution 1 offers distinction between the two tables, and getting data is pretty simple for each. Also, as there is not a real need for a cron job (except if to add expired data as a debit), getBalance() gets the correct current balance. Requires some kind of join to get data for reporting. No redundant data.
Solution 2 holds both transactions in one table, with + and - for amounts, and no updates are occurring to this table; only inserts. Credit Info is being updated on expiry (cron job) or debiting. Single table query to get data for reporting. Some redundancy.
Choice?
Which solution do you think is better? Should the balance be stored physically or should it be calculated (considering that it might be updated with cron jobs)? Which one would be faster?
Also, if you guys have a better suggestion, we'd love to hear it as well.
Which solution do you think is better?
Solution 2. A transaction table with just inserts is easier for financial auditing.
Should the balance be stored physically or should it be calculated (considering that it might be updated with cron jobs)?
The balance should be stored physically. It's much faster than calculating the balance by reading all of the transaction rows every time you need it.
I am IT student that has passed a course called databases, pardon my inexperience.
I made this using MySQL workbench can send you model via email to you do not lose time recreating the model from picture.
This schema was made in 10 minutes. Its holding transactions for a common shop.
Schema explanation
I have a person who can have multiple phones and addresses.
Person makes transactions when he is making a transaction,
You input card name e.g. american express,
card type credit or debit (MySQL workbench does not have domains or constrains as power-designer as far as i know so i left field type as varchar) should have limited input of string debit or credit,
Card expiry date e.g. 8/12,
Card number e.g. 1111111111
Amount for which to decrease e.g. 20.0
time-stamp of transaction
program enters it when entering data
And link it to person that has made the transsaction
via person_idperson field.
e.g.
1
While id 1 in table person has name John Smith
What all this offers:
transaction is uniquely using a card. Credit or Debit cant be both cant be none.
Speed less table joins more speed for system has.
What program requires:
Continuous comparion of fields variable exactTimestamp is less then variable cardExpiery when entering a transaction.
Constant entering of card details.
Whats system does not have
Saving amount that is used in the special field, however that can be accomplished with Sql query
Saving amount that remains, I find that a personal information, What I mean is you come to the shop and shopkeeper asks me how much money do you still have?
system does not tie person to the card explicitly.
The person must be present and use the card with card details, and keeping anonymity of the person. (Its should be a query complex enough not to type by hand by an amateur e.g. shopkeeper) , if i want to know which card person used last time i get his last transaction and extract card fields.
I hope you think of this just as a proposition.
I could really use some insights on choosing between the following two database layouts.
Layout #1 | Layout #2
|
CUSTOMERS | CUSTOMERS
id int pk | id int pk
info char | info char
|
ORDERS | ORDERS
id int pk | id int pk
customerid int fk | customerid int fk
date timedate | date timedate
|
DETAILS | INVOICES
id int pk | id int pk
orderid int fk | orderid int fk
date timedate | date timedate
description char |
amount real | DETAILS
period int | id int pk
| invoiceid int fk
| date timedate
| description char
| amount real
This is a billing application for a small business, a sole proprietor. The first layout has no separate table for invoices, relying instead on the field 'period' in DETAILS for the billing cycle number. The second layout introduces a table specifically for invoices.
Specifically in this application, at what point do you see Layout #1 breaking, or what kind of things will get harder and harder as the amount of data increases? In the case of Layout #2, what does the added flexibility/complexity mean in practical terms? What are the implications for 30-60-90 aging? I'm sure that will be necessary at some point.
More generally, this seems to be a general case of whether you track/control something through a field in a table or a whole new table, yet it's not really a normalization issue, is it? How do you generally make the choice?
Given the previous comments, this is how I would approach it:
CUSTOMERS
id int pk
info char
CASES
id int pk
customerid int fk
dateOpened datetime
dateClosed datetime
status int <- open, closed, final billed, etc.
BillPeriod int <- here is where you determine how often to bill the client.
BillStartDate datetime <- date that billings should start on.
BILLING
billingid int pk
caseid int fk
userid int fk <- id of person who is charging to this case. i.e. the lawyer.
invoicedetailid fk <- nullable, this will make it easier to determine if this particular item has been invoiced or not.
amount money
billdate datetime
billingcode int fk <- associate with some type of billing code table so you know what this is: time, materials, etc.
description char
INVOICES
invoiceid int pk
customerid int FK
invoicedate datetime
amount money <- sum of all invoice details
status int <- paid, unpaid, collection, etc..
discount money <- sum of all invoice details discounts
invoicetotal <- usually amount - discount.
INVOICEDETAILS
invoicedetailid int PK
invoiceid int FK
billingid int FK
discount money <- amount of a discount, if any
===========
In the above you open a "case" and associate it with a customer. On an ongoing basis one or more people apply Billings to the case.
Once the combination of the bill start date and period have elapsed, then the system will create a new Invoice with Details copied from the Billing table. It should do this based on those details that have not already been billed. You should lock the billing record from future changes once it has been invoiced.
You might have to change "BillPeriod" to some other type of field if you need different triggers. For example, period is just one "trigger" to create an invoice.
One might include sending an invoice when you hit a certain dollar amount. This could be configured at the customer or case level. Another option is capping expenditures. For example putting a cap value at the case level which would prevent billings from going over the cap; or at the very least causing an alert to be sent to the relevant parties.
I'm not entirely sure why "period" is attached to the item instead of the order itself. Layout #1 seems to imply that you can have an open "order" that is made up of "details" which might be added and paid for over a period of years. This seems very wrong and should make accounting a nightmare. Layout #2 really isn't much better.
Generally speaking an Order is comprised of a single transaction with a purchase or contract date. That transaction might encompass multiple detail items, but it's still one transaction. It represents a single agreement made at a certain point in time between the buyer and seller. If new items are purchased, a new order is created... With this in mind, neither table structure works.
Regarding Invoices. An order might have one or more invoices attached to it. The goal of invoices is to have payments applied against them. For small transactions there is a one-to-one relationship between invoices and orders.
In larger transactions, you might have multiple invoices that get applied to a single order. For example if you have contracted to make "3 easy payments of $199.99 ..." . In this case you would have 3 invoices for $199.99 each applied to a single order whose total was $599.97; and each due at different time periods.
An Invoice table should then have at minimum the Order Id, Invoice Number, Invoice Date, Invoiced Amount, Date Due, Transaction Id (for credit card), Check Number (obvious), Amount Received and Date Received fields.
If you want to get fancy and support more of the real world, then you would additionally have a Payments table which stored the Invoice Number, Amount Received (or refunded), Date Received, Transaction ID, and Check Number. If you go this route, remove those fields from the Invoice table.
Now, if you need to support recurring charges (for example, internet hosting), then you would have a different table called "Contracts" and "ContractDetails" or something similar. These tables would store the contract details (similar to orders and order details, but includes date start, date end, and recurring period). When the next billing period is hit, then the details would be used to create an order and the appropriate invoices generated..
Since you're doing legal billing, I'd suggest you spend some time looking at the features of Sage Timeslips. Lawyers don't behave like other people; accounting software for lawywers doesn't behave like other accounting software. It's the nature of the business.
They have a 30-day free trial, and you can probably learn a lot from the help files and documentation.
Besides, reverse-engineering database design from the user interface is good practice.
I am developing a Electronic Bill Payment System for a bank which has more than 100 customers subscribing for Electronic Bill Payment.
I have a table in which I am creating profile for customer like the following.
Customer Table
Customer_ID (Primary key)
Customer_Name
Address
Phone
Bill_Master
Customer_ID
Enrollment_Date
UtilityCompany_ID
Bill_Generation_Date (lets say 18th of every month)
Bill_Due_Date (lets say 27th of every month)
Our_CutOff_Date (when the bank will generate the bill for payment)
I have created the customer along with clue that which utility company's bill generates on which date and what will be the due date of EVERY MONTH.
I want to create an interface where the user will see the entries of all the customers whoz date is due for bill generation and after that user will click on the particular entry and generate the bill manually by entering the amount and other details so for this I am clueless how to find out which customer's utility company bill is due for generation..
Any help how should I design it and query of it? or would be a automatic procedure like job or something which will do it.
thanks
Well if its a query you need to generate bills that are due 27th of every month and also find the customer's utility company, then you can try this -
SELECT A.Customer_Name, B.UtilityCompany_ID FROM
Customer A inner join Bill_Master B on A.Customer_ID = B.Customer_ID
WHERE Bill_Due_Date BETWEEN GETDATE() AND DATEADD(MONTH,-1,GETDATE())
I'm building a Volunteer Management System and I'm having some DB design issues:
To explain the process:
Volunteers can sign up for accounts. Volunteers report their hours to a project (each volunteer can have multiple projects). Volunteer supervisors are notified when a volunteers number of hours are close to some specified amount to give them a reward.
For example:
a volunteer who has volunteered 10 hours receives a free t shirt.
The problem I'm having is how to design the DB in such a way that a single reward profile can be related to multiple projects as well as have a single reward profile be "multi-tiered". A big thing about this is that rewards structures may change so they can't be just hardcoded.
Example of what I mean by "multi-tiered" reward profile:
A volunteer who has volunteered 10 hours receives a free t shirt.
A volunteer who has volunteered 40 hours receives a free $50 appreciation check.
The solutions I've come up with myself are:
To have a reward profile table that relates one row to each reward profile.
rewardprofile:
rID(primary key) - int
description - varchar / char(100)
details - varchar / file (XML)
Aside, just while on the topic, can DB field entries be files?
OR
To have a rewards table that relates one preset amount and reward where each row is as follows and a second rewards profile table that binds them the rewards entries together:
rewards:
rID(primary key) - int
rpID (references rewardsProfile) - int
numberOfHrs - int
rewardDesc - varchar / char(100)
rewardsprofile:
rpID(primary key) - int
description
so this might look something like:
rewardsprofile:
rpid | desc
rp01 | no reward
rp02 | t-shirt only
rp03 | t-shirt and check
rewards
rid | rpID | hours | desc
r01 | rp02 | 10 | t-shirt
r02 | rp03 | 10 | t-shirt
r03 | rp03 | 40 | check
I'm sure this issue is nothing new but my google fu is weak and I don't know how to phrase this in a meaningful way. I think there must be a solution out there more formalized than my (hack and slash) method. If anyone can direct me to what this problem is called or any solutions to it, that would be swell. Thanks for all your time!
Cheers,
-Jeremiah Tantongco
Yes, database fields can be files (type binary, character large object, or xml) depending on the implementation of the specific database.
The rewardsprofile table looks like it might be challenging to maintain if you have a large number of different rewards in the future. One thing you might consider is a structure like:
rewards:
rID(primary key) - int
numberOfHrs - int
rewardDesc - varchar / char(100)
volunteers:
vID(primary key) - int
.. any other fields you want here ..
rewardshistory:
vID (foreign key references volunteers)
rID (foreign key references rewards)
Any time you want to add a reward, you add it to the rewards table. Old rewards stay in the table (you might want an 'current' field or something to track whether the reward can still be assigned). The rewardshistory table tracks which rewards have been given to what volunteers.
This is a rough structure of how I would handle this:
Volunteers
volunteerid
firstname
lastname
VolunteerAddress
volunteerid
Street1
Street2
City
State
POstalcode
Country
Addresstype (home, business, etc.)
VolunteerPhone
volunteerid
Phone number
Phonetype
VolunteerEmail
volunteerid
EmailAddress
Project
Projectid
projectname
VolunteerHours
volunteerid
hoursworked
projectid
DateWorked
Rewards
Rewardid
Rewardtype (Continual, datelimited, etc.)
Reward
RewardBeginDate
RewardEndDate
RequiredHours
Awarded
VolunteerID
RewardID
RewardDate
You will probably have some time-limited rewards, that's why I added the date fields. You would then set up a job to calculate rewards once a week or once a month or so. Make sure to exclude those who have already receivced that particualr award if pertinent (You don't want to give a new t-shirt for every 10 hours worked do you?)
Yes, DB field entries can be files. Or, more precisely, they can be filespecs that reference files. Is that what you really meant?
While we are on the subject of data fields that reference other data, how much do you know about foreign keys? What can you accomplish with references to files that you couldn't accomplish even better by the judicious use of foreign keys?
Foreign keys, and the keys that they refer to, are fundamental concepts in the relational model of data. Without this model, your database design is going to be pretty random.
Morning,
You really must place all your tables on a chart then determine the business rules for that chart in the entity relationship diagram. Once you decide what the direct relationships are between each and every table only then would you test to see if you get the desired answers. This procedure is called database design and it appears that you didn't do that as of yet but got ahead of yourself a little bit from what I see.
There are plenty of good books on database design on the market. The one I use is "Database Design For Mere Mortals". It is very easy to read and understand.
Hope this helps.