So I have a transaction table (postgres) that inserts a new row whenever a user renews their subscription for our service. The table subscription looks like this:
+--------+--------+------------+
| userId | prodId | renew_date |
+--------+--------+------------+
| 1 | 1 | 2018-05-01 |
| 1 | 1 | 2018-06-01 |
| 1 | 1 | 2018-07-01 |
| 2 | 3 | 2017-04-16 |
| 2 | 3 | 2017-05-16 |
+--------+--------+------------+
If the analysts want to figure out the Nth renewal or latest renewal for a particular user or product, I have two solutions to give them that:
1.) During my ETL process, I truncate the DW warehouse target table and re-populate it with:
select *
, row_number() over (partition by userId, productId order by renew_date asc) as nth_renewal
from subscription
I can't think of a way where i can +1 to the previous renewal if I were to do incremental updates, what if this is the customers first renewal?
2.) I just copy the exact OLTP table over to the data warehouse and do incremental updates every day. This way, I let the analysts calculate the nth renewal themselves. (also as a follow up question: is it ever OK to have a duplicate copy of a transactional table in my data warehouse?)
Related
I have a table which has records of sessions a players have played in a group music play. (music instruments)
so if a user joins a session, and leaves, there is one row created. If they join even the same session 2x, then two rows are created.
Table: music_sessions_user_history
| Column | Type | Default|
| --- | --- | ---|---
| id | character varying(64) | uuid_generate_v4()|
| user_id | user_id | |
| created_at | timestamp without time zone | now()|
| session_removed_at | timestamp without time zone | |
| max_concurrent_connections | integer |
| music_session_id|character varying(64)|
This table is basically the amount of time a user was in a given session. So you can think of it as a timerange or tsrange in PG. The max_concurrent_connections which is a count of the number of users who were in the session at once.
so the query at it's heart needs to find overlapping time ranges for different users in the same session; and to then count them up as a pair that played together.
The query needs to do this: It tries to report each user that played in a music session with others - and who those users were
So for example, if a userA played with userB, and that's the only data in the database, then two rows would be returned like:
| User | Other users in the session |
| --- | --- |
|userA | [userB] |
|userB | [userA] |
But if userA played with both userB and UserC, then three rows would be like:
| User | Other users in the session |
| --- | --- |
|userA | [userB, userC]|
|userB | [userA, userC]|
|userC | [userA, userB]|
Any help of constructing this query is much appreciated.
update:
I am able to get overlapping records using this query.
select m1.user_id, m1.created_at, m1.session_removed_at, m1.max_concurrent_connections, m1.music_session_id
from music_sessions_user_history m1
where exists (select 1
from music_sessions_user_history m2
where tsrange(m2.created_at, m2.session_removed_at, '[]') && tsrange(m1.created_at, m1.session_removed_at, '[]')
and m2.music_session_id = m1.music_session_id
and m2.id <> m1.id);
Need to find a way to convert these results in to pairs.
create a cursor and for each fetched record determine which other records intersect using a between time of start and end time.
append the intersecting results into a temporary table
select the results of the temporary table
There are few payment methods: credit/debit card, cash, bitcoin
This is my payment transaction table:
Transaction:
| ID | AMOUNT | METHOD |
| 1 | 80 | credit |
| 2 | 100 | cash |
Transaction_credit:
| ID | AMOUNT | TYPE | TRANSACTION_ID |
| 1 | 80 | sale | 1 |
| 2 | -80 | reversal | 1 |
Transaction_cash:
| ID | AMOUNT | TYPE | TRANSACTION_ID |
| 2 | 100 | payment | 2 |
| 2 | -100 | refund | 2 |
Do you think it is a good idea to have amount in card, cash, and bitcoin sub table?
How can I solve the duplicate amount in sub table?
I think your database design needs some improvements.
Firstly: Transaction Entity (Table) in Accounting Systems holds all money transactions. If your sales are reversal, you should make a new Transaction row too. Also, if your Payment refunded, you should make a new Transaction row too.
Secondly: Details of all transactions should be saved in second level Entities (Tables). (as you design correctly). Transaction types (e.g. Card, Cash, Bitcoin and etc.) have many different attributes. So putting all types in one entity, make some bad design traps like Nullification.
Thirdly: If you want to have a complete Accounting System to supports all accounting parts (like generating Balance Sheet), you should add many other entities.
But in this case, you should hold Amount in Transaction. Fining Amount in other tables is so difficult when you want to perform some queries based on overall Amount on Transaction.
Yesterday, I was asked the same question by two different people. Their tables have a field that groups records together, like a year or location. Within those groups, they want to have a unique ID that starts at 1 and increments up sequentially. Obviously, you could search for MAX(ID), but if these applications have a lot of traffic, they'd need to lock the entire table to ensure the same ID wasn't returned multiple times. I thought about using sequences but that would mean dynamically creating a sequence for each group.
Example 1:
Records created during the year should increment by one and then restart at 1 at the beginning of the next year.
| Year | ID |
|------|----|
| 2016 | 1 |
| 2016 | 2 |
| 2017 | 1 |
| 2017 | 2 |
| 2017 | 3 |
Example 2:
A company has many locations and they want to generate a unique ID for each customer, combining a the location ID with a incrementing ID.
| Site | ID |
|------|----|
| XYZ | 1 |
| ABC | 1 |
| XYZ | 2 |
| XYZ | 3 |
| DEF | 1 |
| ABC | 2 |
One trick that is often under-used is to create a clustered index on Site / ID or Year / ID - BUT Change the order of the ID column to Desc rather than ASC.
This way when you need to scan the CI to get the Next ID value it only needs to check 1 row in the clustered index. I've used this on Multi-Billion Record tables and it runs quite quickly. You can get even better performance by partitioning the table by Site or Year then you'll get the added benefit of partition elimination when you run your MAX(ID) queries.
I am pretty new to database development and architecture. My only experience has been in college and now my project requires me to use that knowledge, however my project seems a lot more complicated with many more intricacies than what I studied.
A brief overview: My task is to basically turn paper work that was previously done by hand, into a quick computer application, which I will do in Java but thats far off now. I know I will need a database set up to accomplish my task since these reports are frequently edited. The report is a Labor Report. Basically, it shows who was working on a specific job, what days and how many hours on those days, as well as their total hours, pay rate, and total amount.
I believe my current problem lies within the fact that it seems like I'm going to have several "many to many" relationships, perhaps even nested, which is what is throwing my head for a spin as I try to organize information into entity relationship diagrams and tables. (I know that there are normally much more measured and organized stages to development but I don't have that experience and I'm essentially a one man team on this)
Contract Personnel with be selected out of a pool of Employees.
A Labor Contract can have 1 to 10 personnel (For sake of space on the final printed version, jobs requiring more laborers will have another Labor Contract.)
Each personnel must have 1 Title (foreman, mechanic, etc.) These titles can change from job to job. Joe Smith can be a mechanic on job A but a foreman on job B.
Each personnel must also have on record the number of hours they worked on each day of the week; and may have overtime and double overtime. (One Labor Record per week).
I am trying to avoid repeated data, or at least keep it to a minimum but I am struggling on figuring out how to do that in this situation. The tricky thing, at least in my mind, is figuring out how to handle the fact that different employees can work several jobs at once, under different titles, and different pay rates, and recording different types of hours (straight time, OT, double OT) on each day of the week.
Can anyone make suggestions?
I hope that I have supplied adequate information and apologize if I didn't or wasn't detailed enough. Please remember to keep in mind I'm a newbie to this type of work.
First thing, take a deep breath! It looks to me like you have a pretty good handle on this, maybe more than you think! This is not at all to try and design your project, and I'm sure you'll have lots of details to deal with, but maybe this will give an idea of how you might face these many many-to-many relationships swimming around in your head.
EMPLOYEES
---------
emp_id
emp_name
emp_address
JOBS
----
job_id
job_description
EMPLOYEE_JOBS
-------------
ej_id -- primary key
emp_id -- fk to employees table
job_id -- fk to jobs table
ej_title -- employee title for this job
ej_rate -- employee pay rate for this job
EMPLOYEE_JOB_HOURS
------------------
ejh_id -- primary key
ej_id -- fk to employee_jobs table
ejh_date
ejh_normal_hours -- hours worked by the employee on this job on this date, etc.
ejh_overtime_hours
ejh_double_overtime_hours
Following is a basic outline you could use to get started. Your final solution will be different based on your exact needs.
You'll need a table to store contract information. My example just shows a description but I'm sure you'll have much more than that.
contracts
id unsigned int(P)
description varchar(50)
+----+-------------+
| id | description |
+----+-------------+
| 1 | Contract A |
| 2 | Contract B |
| .. | ........... |
+----+-------------+
You'll need a table that links contracts and employees and shows what title the employee has for the given contract. In my example you can see that for Contract A John Q Public is a Foreman and Mary Jane Smith is a Mechanic. For Contract B their titles are reversed, John is a Mechanic and Mary is a Foreman. contract_id and employee_id are foreign keys to their respective tables and together they form the primary key. If it's possible that John and Mary get paid different rates for the same title (for example John get 25.00/hour as Foreman while Mary gets 20.00/hour) you would add a column here instead of using the rate in the titles table.
contracts_employees
contract_id unsigned int(F contracts.id)--\_(P)
employee_id unsigned int(F employees.id)--/
title_id varchar(15)(F titles.id)
+-------------+-------------+----------+
| contract_id | employee_id | title_id |
+-------------+-------------+----------+
| 1 | 1 | Foreman |
| 1 | 2 | Mechanic |
| 2 | 1 | Mechanic |
| 2 | 2 | Foreman |
| ........... | ........... | ........ |
+-------------+-------------+----------+
You'll need a table for employees (you could call this personnel if you prefer). You'll probably store a lot more than just their names...
employees
id unsigned int(P)
first_name varchar(30)
middle_name varchar(30)
last_name varchar(30)
...
+----+------------+-------------+-----------+-----+
| id | first_name | middle_name | last_name | ... |
+----+------------+-------------+-----------+-----+
| 1 | John | Quincy | Public | ... |
| 2 | Mary | Jane | Smith | ... |
| .. | .......... | ........... | ......... | ... |
+----+------------+-------------+-----------+-----+
You'll need a table to track hours worked. I just store a beginning and ending date/time, leaving it up to the application to calculate elapsed time. Your application will also need to ensure there is no overlap for employees - an employee should not be able to be working on more than one contract at any given time. Calculation of overtime and double overtime hours is also up to your application. If an employee's pay rate can change at any time (ie in the middle of a contract) you would want to store the pay rate in this table instead of using the rate from contracts_employees or titles.
hours
id unsigned int(P)
contract_id unsigned int(F contracts.id)
employee_id unsigned int(F employees.id)
beg datetime
end datetime
+----+-------------+-------------+---------------------+---------------------+
| id | contract_id | employee_id | beg | end |
+----+-------------+-------------+---------------------+---------------------+
| 1 | 1 | 1 | 2014-01-01 08:00:00 | 2014-01-01 17:00:00 |
| 2 | 1 | 2 | 2014-01-01 09:00:00 | 2014-01-01 17:30:00 |
| 3 | 1 | 1 | 2014-01-02 09:00:00 | 2014-01-02 10:00:00 |
| 4 | 1 | 2 | 2014-01-02 08:00:00 | 2014-01-02 09:00:00 |
| 5 | 2 | 1 | 2014-01-02 10:00:00 | 2014-01-02 17:30:00 |
| 6 | 2 | 2 | 2014-01-02 09:00:00 | 2014-01-02 15:00:00 |
| .. | ........... | ........... | ................... | ................... |
+----+-------------+-------------+---------------------+---------------------+
And finally a table to store titles and their related pay rates. If employees can be paid different rates for the same title, you wouldn't need the rate column here, instead you would use the rate stored in the contracts_employees table.
titles
id varchar(15)(P)
rate double
+----------+-------+
| id | rate |
+----------+-------+
| Foreman | 20.00 |
| Mechanic | 15.00 |
| ........ | ..... |
+----------+-------+
I need to regularly import large (hundreds of thousands of lines) tsv files into multiple related SQL Server 2008 R2 tables.
The input file looks something like this (it's actually even more complex and the data is of a different nature, but what I have here is analogous):
January_1_Lunch.tsv
+-------+----------+-------------+---------+
| Diner | Beverage | Food | Dessert |
+-------+----------+-------------+---------+
| Nancy | coffee | salad_steak | pie |
| Joe | milk | soup_steak | cake |
| Pat | coffee | soup_tofu | pie |
+-------+----------+-------------+---------+
Notice that one column contains a character-delimited list that needs preprocessing to split it up.
The schema is highly normalized -- each record has multiple many-to-many foreign key relationships. Nothing too unusual here...
Meals
+----+-----------------+
| id | name |
+----+-----------------+
| 1 | January_1_Lunch |
+----+-----------------+
Beverages
+----+--------+
| id | name |
+----+--------+
| 1 | coffee |
| 2 | milk |
+----+--------+
Food
+----+-------+
| id | name |
+----+-------+
| 1 | salad |
| 2 | soup |
| 3 | steak |
| 4 | tofu |
+----+-------+
Desserts
+----+------+
| id | name |
+----+------+
| 1 | pie |
| 2 | cake |
+----+------+
Each input column is ultimately destined for a separate table.
This might seem an unnecessarily complex schema -- why not just have a single table that matches the input? But consider that a diner may come into the restaurant and order only a drink or a dessert, in which case there would be many null rows. Considering that this DB will ultimately store hundreds of millions of records, that seems like a poor use of storage. I also want to be able to generate reports for just beverages, just desserts, etc., and I figure those will perform much better with separate tables.
The orders are tracked in relationship tables like this:
BeverageOrders
+--------+---------+------------+
| mealId | dinerId | beverageId |
+--------+---------+------------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | 1 |
+--------+---------+------------+
FoodOrders
+--------+---------+--------+
| mealId | dinerId | foodId |
+--------+---------+--------+
| 1 | 1 | 1 |
| 1 | 1 | 3 |
| 1 | 2 | 2 |
| 1 | 2 | 3 |
| 1 | 3 | 2 |
| 1 | 3 | 4 |
+--------+---------+--------+
DessertOrders
+--------+---------+-----------+
| mealId | dinerId | dessertId |
+--------+---------+-----------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | 1 |
+--------+---------+-----------+
Note that there are more records for Food because the input contained those nasty little lists that were split into multiple records. This is another reason it helps to have separate tables.
So the question is, what's the most efficient way to get the data from the file into the schema you see above?
Approaches I've considered:
Parse the tsv file line-by-line, performing the inserts as I go. Whether using an ORM or not, this seems like a lot of trips to the database and would be very slow.
Parse the tsv file to data structures in memory, or multiple files on disk, that correspond to the schema. Then use SqlBulkCopy to import each one. While it's fewer transactions, it seems more expensive than simply performing lots of inserts, due to having to either cache a lot of data or perform many writes to disk.
Per How do I bulk insert two datatables that have an Identity relationship and Best practices for inserting/updating large amount of data in SQL Server 2008, import the tsv file into a staging table, then merge into the schema, using DB functions to do the preprocessing. This seems like the best option, but I'd think the validation and preprocessing could be done more efficiently in C# or really anything else.
Are there any other possibilities out there?
The schema is still under development so I can revise it if that ends up being the sticking point.
You can import you file in the table of the following structure: Diner, Beverage, Food, Dessert, ID (identity, primary key NOT CLUSTERED - for performance issues).
After this simply add the following columns: Dinner_ID, Beverage_ID, Dessert_ID and fill them according to your separate tables (it's simple to group each of the columns and to add the missing data to lookup tables as Beverages, Desserts, Meals and, after this, to fix the imported table with the IDs for existent and newly added records).
The situation with Food table is more complex because of ability to combine the foods, but the same trick can be used: you can also add the data to your lookup table and, among this, store the combinations of foods in the additional temp table (with the unique ID) and separation on the single dishes.
When the parcing will be finished, you will have 3 temp tables:
table with all your imported data and IDs for all text columns
table with distinct food lists (with IDs)
table with IDs of food per food combination
From the above tables you can perform the insertion of the parsed values to either structure as you want.
In this case only 1 insert (bulk) will be done to the DB from the code side. All other data manipulations will be performed in the DB.