I am working on a star schema and I want to track the history of data for some dimensions and specifically for some columns. Is it possible to work with temporal tables as an other alternative ? If yes, how to store the current record in a temporal table? Also, is it logic that the source of my dimension will be the historical table of my temporal table?
Determining if two rows or expressions are equal can be a difficult and resource intensive process. This can be the case with UPDATE statements where the update was conditional based on all of the columns being equal or not for a specific row.
To address this need in the SQL Server environment the CHECKSUM function ,in your case is helpful as it natively creates a unique expression for comparison between two records.
So you will compare between your two sources which are logically the ODS and Datawarehouse. If the Chescksum between two different sources isn't the same, you will update the old record and insert the new updated one.
Related
I want to design a data warehouse (Data MART) with one fact table and 2 dimensional tables, where the data mart takes some Slowly Changing Dimensions into consideration, with surrogate key. I'm wondering how I can model this so that data insertion to the dimensional tables can be made independent (inserted before fact table row exist) of the fact table. The data will be streamed from PubSub to BigQuery via Dataflow, thus some of the dimensional data might arrive earlier, needing to be inserted into the dimensional table before the fact data.
I don't completely understand your question. Dimensions are always (or rather, almost always) populated before fact tables are, since fact table records refer to dimensions (and not the other way around).
If you're worried about being able to destroy and rebuild your dimension table without having to also rebuild your fact table, then you'll need to use some sort of surrogate key pipeline to maintain your surrogate key to natural key relationships. But again, I'm not sure that this is what you're asking.
BigQuery does not perform referential integrity check, which means it will not check whether parent row exists in dimension table while inserting child row into fact table and you don't need this in data analytics setup. You can keep appending records to both fact table and dimension tables independently in BigQuery.
Flatten / denormalise the table and keep dimensions in fact tables - repeated records are not going to be an issue in BigQuery - you can make use of Clustering and Partitioning
Other option is, if you have dimensions in RDBMS system, upload dimension tables as files to Cloud Storage / rows to Cloud SQL and join them in Dataflow, in this case you can skip multiple sinks - you can write to a flatten schema into a BigQuery table sink.
Inserting order does not matter in BigQuery, you can reference event records based on pubsub message publishing time / source event time, etc.
I have been working with Redshift and now testing Snowflake. Both are columnar databases. Everything I have read about this type of databases says that they store the information by column rather than by row, which helps with the massive parallel processing (MPP).
But I have also seen that they are not able to change the order of a column or add a column in between existing columns (don't know about other columnar databases). The only way to add a new column is to append it at the end. If you want to change the order, you need to recreate the table with the new order, drop the old one, and change the name of the new one (this is called a deep copy). But this sometimes can't be possible because of dependencies or even memory utilization.
I'm more surprised about the fact that this could be done in row databases and not in columnar ones. Of course, there must be a reason why it's not a feature yet, but I clearly don't have enough information about it. I thought it was going to be just a matter of changing the ordinal of the tables in the information_schema but clearly is not that simple.
Does anyone know the reason of this?
Generally, column ordering within the table is not considered to be a first class attribute. Columns can be retrieved in whatever order you require by listing the names in that order.
Emphasis on column order within a table suggests frequent use of SELECT *. I'd strongly recommend not using SELECT * in columnar databases without an explicit LIMIT clause to minimize the impact.
If column order must be changed you do that in Redshift by creating a new empty version of the table with the columns in the desired order and then using ALTER TABLE APPEND to move the data into the new table very quickly.
https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE_APPEND.html
The order in which the columns are stored internally cannot be changed without dropping and recreating them.
Your SQL can retrieve the columns in any order you want.
General requirement to have columns listed in some particular order is for the viewing purpose.
You could define a view to be in the desired column order and use the view in the required operation.
CREATE OR REPLACE TABLE CO_TEST(B NUMBER,A NUMBER);
INSERT INTO CO_TEST VALUES (1,2),(3,4),(5,6);
SELECT * FROM CO_TEST;
SELECT A,B FROM CO_TEST;
CREATE OR REPLACE VIEW CO_VIEW AS SELECT A,B FROM CO_TEST;
SELECT * FROM CO_VIEW;
Creating a view to list the columns in the required order will not disturb the actual table underneath the view and the resources associated with recreation of the table is not wasted.
In some databases (Oracle especially) ordering columns on table will make difference in performance by storing NULLable columns at the end of the list. Has to do with how storage is beiing utilized within the data block.
I have two table with the same columns,one is use to save bank's amount and the other to save cashdesk,they both might have many data,so i'm concerned about data retrieving speed.I don't know it's better to combine them by adding extra column to determine type of each record or create a separate table for each one?
The main question you should be asking is - how am I querying the tables.
If there is no real logical connection between the 2 tables (you don't want to get the rows in the same query) - use 2 different tables, since the other why around you will need to hold another column to tell you what type of row you are working on, and that will slow you down and make your queries more complex
In addition FKs might be a problem if the same column if a FK to 2 different places
In addition (2nd) - locks might be an issue - if you work on one type you might block the other
conclusion - 2 tables, not just for speed
In theory you have one unique entity, So you need to consider one table to your accounts and another one for your types of accounts, for better performance you could separate these types of account on two different file groups and partitions and create an index on the typeFK for account table, in this scenario you have logically one entity that is ruled by relational theory and physically your data is separated and data retrieval process would be fast and beneficial.
Following the paper published on C-store, I did not understand the part
Redundant storage of elements of a table in several overlapping projections in different orders, so that a query can be solved using the most advantageous projection
Firstly, I did not understand how is it derived as to which column(s) make up for "redundant" columns in the database table?
Second, in reference to the point above, my understanding is that these columns marked as "redundant" don't have to be stored in every projection that is created on the table. If a query requests for such a column, only one of the projections would be needed to answer the query. Am I correct?
I have an excel spreadsheet i am going to be turning into a DB to mine data and build an interactive app. There are about 20 columns and 80,000 records. Practically all records have about half of their column data as null, but which column has data is random for each record.
The options would be to:
Create a more normalized DB with a table for each column and use 20 joins to view all data. I would think the benefits would be a DB with really no NULL values so the size would be smaller. One of the major cons would be more code to update each table from the application side.
Create a flat file with one table that has all columns. I figure this will be easier for the application side to do updates, but will result in a table that has a butt load of empty dataspace.
I don't get why you think updating a normalized db is harder than a flat table. It's very much the other way around.
Think about inserting a relation between a customer and a product (basically an order). You'd have to:
select the row that describes the rest of the data, but has nulls or something in the product columns
you have to update the product columns
you have to insert a HUGE row to the db
What about the first time? What do you do with the initial nulls? Do you modify your selects to ignore them? What if you want the nulls?
What if you delete the last product? Do you change it into an update and set nulls for just a few columns?
Joins aside, working with a normalized table is trivial by design. You pay for its triviality with performance, that's the actual trade-off.
If you are going to be using a relational database, you should normalize your tables, if nothing else in order to ease data maintenance and ensure you don't have duplicate data.
You can investigate the use of a document database for storage instead of a relational database, though it is not the only option.
Generally normalized databases will end up being easier to write code against as SQl code is deisgned with normalized tables in mind.
Normalizing doesn't have to be done on all columns, so there's a middle ground between the two options you present. A good rule of thumb is that if you have columns that have values being repeated heavily across records, those can be good candidates for normalizing into one or more separate tables. Putting each column in its own table and joining across them is almost certainly overdoing it.
Don't normalize too much. It's hard to maintain a canonical model as your application grows. Storage is cheap. Don't get fooled into coding head aches because of concerns that were valid 20 years ago. No need to go nosql unless you need it.