I have two source tables, one is basically an invoice, the other is a migrated invoice. The same object should probably have been used for both, but I have this instead. They contain most of the same data.
I had thought to combine both into a dimension table, however both will use the same natural keys. How should I approach this?
One potential solution I thought of was using negative numbers for the migrated table, but then the natural keys won't align exactly with the source.
Do I just combine them in the fact table? Then I can't link back to the dimension table for either due to NULLs.
Or do I add an additional column or information to indicate which type of invoice it is?
EDIT
Simple models of the current tables below.
The dimension currently only contains the non migrated data, it has a primary key, however
if i merge the migrated invoice table in to this, it will appear as if the changes are being
made to the original invoices and not a second set of invoices
Dimension
surrogate_key| source_pk | Total | scd_from | scd_to
| | | |
1 | 1 | 100 | 01/01/2019 | 31/01/2019
2 | 1 | 150 | 01/02/2019 | 31/12/2019
3 | 2 | 50 | 01/01/2019 | 31/12/9999
source invoice table
pk | Total
___________________
1 | 150
2 | 50
source migrated invoice table
pk | total
___________________
1 | 200
2 | 300
If invoice and migrated invoice have same natural key but some of the fields have different values (your example shows Total amount different between them), then you have one row based on the natural key in the Dim but 2 different columns to represent the 2 sources. Based on your example, you need invoice_Total and migrated_invoice_Total columns in your DIM.
Related
I've a fact table that details individual line amounts for orders placed by my organisation. In this fact, at line level, I've included the total order amount to be used, as it's possible we might need that level of detail at some point.
Here's an example of what I've got:-
+------------+------------+---------------+------------+---------------------+
| BookingKey | Booking_ID | Category_FKey | Line_Value | Total_Booking_Value |
+------------+------------+---------------+------------+---------------------+
| 1 | 12 | 8 | 150 | 700 |
| 2 | 12 | 4 | 150 | 700 |
| 3 | 12 | 5 | 300 | 700 |
| 4 | 12 | 4 | 100 | 700 |
+------------+------------+---------------+------------+---------------------+
As you can see, the Total_Booking_Value here is the sum of the Line_Value for the booking in the example (Booking_ID = 12).
The Category_FKey looks up to a Categories dimension.
Using this structure I've created a simple cube and this works fine, mainly.
The issue I have is that I'd like to be able to view the Total Line_Value amount, and somehow include the Total_Booking_Value alongside it.
So, for example I might add the Categories dimension as a filter and want to filter by say Category_FKey = 4.
If this was the case I'd want the aggregates to tell me that the total Line_Value was 250 (for BookingKeys 2 and 4), and the Total_Booking_Value should be 700. Using normal aggregation (ie SUM) I'm getting the Total_Booking_Value as 1400 (obviously - because it's adding 700 * 2 for the two rows the cube would return).
So, the way I see it I'd like to create an MDX calculation that somehow takes the Total_Booking_Value and gives just the value for the Booking in question.
Should this be done using some kind of average, or division by the Distinct number of items? I can't figure this out. I tried something like this:-
create member currentcube.measures.[Calculated Booking Value]
as
[Measures].[Total_Booking_Value] / count(Measures.Booking_ID);
But this isn't working.
Hopefully this makes sense and you can point me in the right direction.
I find it strange that booking_ID is a measure - intuitively it strikes me as something that would be an attribute and therefore a hierarchy - in which case you'd be able to do the count like this:
[Measures].[Total_Booking_Value]
/
COUNT(EXISTING [Booking].[Booking_ID].[Booking_ID].members)
A straightforward solution would be to have two fact tables: one with granularity booking key and one with granularity booking id. The first would contain all columns except total booking value, and the second would contain columns booking id and total booking value.
Then each of both measures would easily be summable.
The reference type between the second fact table and the category dimension could be configures as many-to-many via the first fact table. Thus, you would see the full values of the involved bookings for each selected category, automatically eliminating double counting.
I have to optimize my little-big database, because it's too slow, maybe we'll find another solution together.
First of all let's talk about data that are stored in the database. There are two objects: users and let's say messages
Users
There is something like that:
+----+---------+-------+-----+
| id | user_id | login | etc |
+----+---------+-------+-----+
| 1 | 100001 | A | ....|
| 2 | 100002 | B | ....|
| 3 | 100003 | C | ....|
|... | ...... | ... | ....|
+----+---------+-------+-----+
There is no problem inside this table. (Don't afraid of id and user_id. user_id is used by another application, so it has to be here.)
Messages
And the second table has some problem. Each user has for example messages like this:
+----+---------+------+----+
| id | user_id | from | to |
+----+---------+------+----+
| 1 | 1 | aab | bbc|
| 2 | 2 | vfd | gfg|
| 3 | 1 | aab | bbc|
| 4 | 1 | fge | gfg|
| 5 | 3 | aab | gdf|
|... | ...... | ... | ...|
+----+---------+------+----+
There is no need to edit messages, but there should be an opportunity to updated the list of messages for the user. For example, an external service sends all user's messages to the db and the list has to be updated.
And the most important thing is that there are about 30 Mio of users and average user has 500+ of messages. Another problem that I have to search through the field from and calculate number of matches. I designed a simple SQL query with join, but it takes too much time to get the data.
So...it's quite big amount of data. I decided not to use RDS (I used Postgresql) and decided to move to databases like Clickhouse and so on.
However I faced with a problem that for example Clickhouse doesn't support UPDATE statement.
To resolve this issues I decided to store messages as one row. So the table Messages should be like this:
Here I'd like to store messages in JSON format
{"from":"aaa", "to":bbe"}
{"from":"ret", "to":fdd"}
{"from":"gfd", "to":dgf"}
||
\/
+----+---------+----------+------+ And there I'd like to store the
| id | user_id | messages | hash | <= hash of the messages.
+----+---------+----------+------+
I think that full-text search inside the messages column will save some time resources and so on.
Do you have any ideas? :)
In ClickHouse, the most optimal way is to store data in "big flat table".
So, you store every message in a separate row.
15 billion rows is Ok for ClickHouse, even on single node.
Also, it's reasonable to have each user attributes directly in messages table (pre-joined), so you don't need to do JOINs. It is suitable if user attributes are not updated.
These attributes will have repeated values for each users' message - it's Ok because ClickHouse compresses data well, especially repeated values.
If users' attributes are updated, consider to store users table in separate database and use 'External dictionaries' feature to join it.
If message is updated, just don't update it. Write another row with modified message to a table instead and leave old message as is.
Its important to have right primary key for your table. You should use table from MergeTree family, which constantly reorders data by primary key and so maintains efficiency of range queries. Primary key is not required to be unique, for example you could define primary key as just (from) if you would frequently write "from = ...", and if these queries must be processed in short time.
And you could use user_id as primary key: if queries by user id are frequent and must be processed as fast as possible, but then queries with predicate on 'from' will scan whole table (mind that ClickHouse do full scan efficiently).
If you need to fast lookup by many different attributes, you could just duplicate table with different primary keys. It's typically that table will be compressed well enough and you could afford to have data in few copies with different order for different range queries.
First of all, when we have such a big dataset, from and to columns should be integers, if possible, as their comparison is faster.
Second, you should consider creating proper indexes. As each user has relatively few records (500 compared to 30M in total), it should give you a huge performance benefit.
If everything else fails, consider using partitions:
https://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
In your case they would be dynamic, and hinder first time inserts immensely, so I would consider them only as last, if very efficient, resort.
Context: simple webapp game for personal learning purposes, using postgres. I can design it however I want.
2 tables 1 view (there are additional tables view references that aren't important)
Table: Research
col: research_id (foreign key to an outside table)
col: category (integer foreign key to category table)
col: percent (integer)
constraint (unique combination of the three columns)
Table: Category
col: category_id (primary key auto inc)
col: name(varchar(255))
notes: this table exists to capture the 4 categories of research I want in business logic and which I assume is not best practice to hardcode as columns in the db
View: Research_view
col: research_id (from research table)
col: foo1 (one of the categories from category table)
col: foo2 (etc...)
col: other cols from other joins
notes:has insert/update/delete statements that uses above tables appropriately
The research table itself I worry qualifies as a "Skinny Table" (hadn't heard the term until I just saw it in the Ibatis manning book). For example test data within it looks like:
| research_id | percent | category |
| 1 | 25 | 1 |
| 1 | 25 | 2 |
| 1 | 25 | 3 |
| 1 | 25 | 4 |
| 2 | 20 | 1 |
| 2 | 30 | 2 |
| 2 | 25 | 3 |
| 2 | 25 | 4 |
1) Does it make sense to have all columns in a table collectively define unique entries?
2) Does this 'smell' to you?
Couple of notes to start:
constraint (unique combination of the three columns)
It makes no sense to have a unique constraint that includes a single-column primary key. Including that column will cause every row to be unique.
notes: this table exists to capture the 4 categories of research I want in business logic and which I assume is not best practice to hardcode as columns in the db
If a research item/entity is required to have all four categories defined for it to be valid, they should absolutely be columns in the research table. I can't tell definitively from your statement whether this is the case or not, but your assumption is faulty if looked at in isolation. Let your model reflect reality as closely as possible.
Another factor is whether it's a requirement that additional categories may be added to the system post-deployment. Whether the categories are intended to be flexible vs. fixed should absolutely influence the design.
1) Does it make sense to have all columns in a table collectively
define unique entries?
I would say it's not common, but can imagine there are situations where it might be appropriate.
2) Does this 'smell' to you?
Hard to say without more details.
All that said, if the intent is to view and add research items with all four categories, I would say (again) that you should consider whether the four categories are semantically attributes of the research entity.
As a random example, things like height and weight might be considered categories of a person, but they would likely be stored flat on the person table, and not in a separate table.
Imagine the following: there is a "recipe" table and a "recipe-step" table. The idea is to allow different recipe-steps to be reused in different recipes. The problem I'm having relates to the fact that in the recipe context, the order in which the recipe-steps show up is important, even if it does not follow the recipe-step table primary-key order, because this order will be set by the user.
I was thinking of doing something like:
recipe-step table:
id | stepName | stepDescription
-------------------------------
1 | step1 | description1
2 | step2 | description2
3 | step3 | description3
...
recipe table:
recipeId | step
---------------
1 | 1
1 | 2
1 | 3
...
This way, the order in which the steps show up in the step column is the order I need to maintain.
My concerns with this approach are:
if I have to add a new step between two existing steps, how do I query it? What if I just need to switch the order of two steps already in the sequence?
how do I make sure the order maintains its consistency? If I just insert or update something in the recipe table, it will pop up at the end of the table, right?
Is there any other way you would think of doing this? I also thought of having a previous-step and a next-step column in the recipe-step table, but I think it would be more difficult to make the recipe-steps reusable that way.
In SQL, tables are not ordered.
Unless you are using an ORDER BY clause, database engines are allowed to return records in any order they feel is fastest (for example, a covering index might have the data in a different order, and sometimes even SQLite creates temporary covering indexes automatically).
If the steps have a specific order in a specific recipe, then you have to store this information in the database.
I'd suggest to add this to the recipe table:
recipeId | step | stepOrder
---------------------------
1 | 1 | 1
1 | 2 | 2
1 | 3 | 3
2 | 4 | 1
2 | 2 | 2
Note:
The recipe table stores the relationship between recipes and steps, so it should be called recipe-step.
The recipe-step table is independent of recipes, so it should be called step.
You probably need a table that stores recipe information that is independent of steps; this table should be called recipe.
I have to generate a view that shows tracking across each month. The ultimate view will be something like this:
| Person | Task | Jan | Feb | Mar| Apr | May | June . . .
| Joe | Roof Work | 100% | 50% | 50% | 25% |
| Joe | Basement Work | 0% | 50% | 50% | 75% |
| Tom | Basement Work | 100% | 100% | 100% | 100% |
I already have the following tables:
Person
Task
I am now creating a new table to foreign key into the above 2 tables and i am trying to figure out the pros and cons of creating 1 or 2 tables.
Option 1:
Create a new table with the following Columns:
Id
PersonId
TaskId
Jan2012
Feb2012
Mar2012
Apr2013
or
Option 2:
have 2 seperate tables
One table for just
Id
PersonId
TaskId
and another table for just the following columns
Id
PersonTaskId (the id from table above)
MonthYearKey
MonthYearValue
So an example record would be
| 1 | 13 | Jan2011 | 100% |
where 13 would represent a specific unique Person and Task combination. This second way would avoid having to create new columns to continue over time (which seems right) but i also want to avoid overkill.
which would be a more scalable way to have this schema. Also, any other suggestions or more elegant ways of doing this would be great as well?
You can have a m2m table with data columns. I don't see a reason why you can't just put MonthYearKey, MonthYearValue on the same table with PersonId and TaskId
Id
TaskId
PersonId
MonthYearKey
MonthYearValue
It's possible too that you would want to move the MonthYearKey out into their own table, it really just comes down to common queries and what this data is used for.
I would note, you never want to design a schema where you are adding columns due to time. The first option would require maintenance all the time, and would become very difficult to query also.
Option 2 is definitely more scalable and is not overkill.
Option 1 would require you to add a new column every month and simple date based queries of your data would not be possible, e.g. Show me all people who worked at least 90% in any month last year.
The ultimate view would be generated from a particular query or view of your data.