Data Warehouse design(BigQuery), load into dimensional table independent of fact table

Data Warehouse design(BigQuery), load into dimensional table independent of fact table - database

I want to design a data warehouse (Data MART) with one fact table and 2 dimensional tables, where the data mart takes some Slowly Changing Dimensions into consideration, with surrogate key. I'm wondering how I can model this so that data insertion to the dimensional tables can be made independent (inserted before fact table row exist) of the fact table. The data will be streamed from PubSub to BigQuery via Dataflow, thus some of the dimensional data might arrive earlier, needing to be inserted into the dimensional table before the fact data.

I don't completely understand your question. Dimensions are always (or rather, almost always) populated before fact tables are, since fact table records refer to dimensions (and not the other way around).
If you're worried about being able to destroy and rebuild your dimension table without having to also rebuild your fact table, then you'll need to use some sort of surrogate key pipeline to maintain your surrogate key to natural key relationships. But again, I'm not sure that this is what you're asking.

BigQuery does not perform referential integrity check, which means it will not check whether parent row exists in dimension table while inserting child row into fact table and you don't need this in data analytics setup. You can keep appending records to both fact table and dimension tables independently in BigQuery.
Flatten / denormalise the table and keep dimensions in fact tables - repeated records are not going to be an issue in BigQuery - you can make use of Clustering and Partitioning
Other option is, if you have dimensions in RDBMS system, upload dimension tables as files to Cloud Storage / rows to Cloud SQL and join them in Dataflow, in this case you can skip multiple sinks - you can write to a flatten schema into a BigQuery table sink.
Inserting order does not matter in BigQuery, you can reference event records based on pubsub message publishing time / source event time, etc.

Related

How does columnfamily from BigTable in GCP relate to columns in a relational database

I am trying to migrate a table that is currently in a relational database to BigTable.
Let's assume that the table currently has the following structure:
Table: Messages
Columns:
Message_id
Message_text
Message_timestamp
How can I create a similar table in BigTable?
From what I can see in the documentation, BigTable uses ColumnFamily. Is ColumnFamily the equivalent of a column in a relational database?

BigTable is different from a relational database system in many ways.
Regarding database structures, BigTable should be considered a wide-column, NoSQL database.
Basically, every record is represented by a row and for this row you have the ability to provide an arbitrary number of name-value pairs.
This row has the following characteristics.
Row keys
Every row is identified univocally by a row key. It is similar to a primary key in a relational database. This field is stored in lexicographic order by the system, and is the only information that will be indexed in a table.
In the construction of this key you can choose a single field or combine several ones, separated by # or any other delimiter.
The construction of this key is the most important aspect to take into account when constructing your tables. You must thing about how will you query the information. Among others, keep in mind several things (always remember the lexicographic order):
Define prefixes by concatenating fields that allows you to fetch information efficiently. BigTable allows and you to scan information that starts with a certain prefix.
Related, model your key in a way that allows you to store common information (think, for example, in all the messages that come from a certain origin) together, so it can be fetched in a more efficient way.
At the same time, define keys in a way that maximize dispersion and load balance between the different nodes in your BigTable cluster.
Column families
The information associated with a row is organized in column families. It has no correspondence with any concept in a relational database.
A column family allows you to agglutinate several related fields, columns.
You need to define the column families before-hand.
Columns
A column will store the actual values. It is similar in a certain sense to a column in a relational database.
You can have different columns for different rows. BigTable will sparsely store the information, if you do not provide a value for a row, it will consume no space.
BigTable is a third dimensional database: for every record, in addition to the actual value, a timestamp is stored as well.
In your use case, you can model your table like this (consider, for example, that you are able to identify the origin of the message as well, and that it is a value information):
Row key = message_origin#message_timestamp (truncated to half hour, hour...)1#message_id
Column family = message_details
Columns = message_text, message_timestamp
This will generate row keys like, consider for example that the message was sent from a device with id MT43:
MT43#1330516800#1242635
Please, as #norbjd suggested, see the relevant documentation for an in-deep explanation of these concepts.
One important difference with a relational database to note: BigTable only offers atomic single-row transactions and if using single cluster routing.
1 See, for instance: How to round unix timestamp up and down to nearest half hour?

Performance improvement by selective updating of Fact table columns in dimensional data warehouse

I'm trying to improve the loading of a dimensional data warehouse. I need to load tens of thousands of records each time I run the loading operation; sometimes I need to load hundreds of thousands of records; semi-annually I need to load millions of records.
For the loading operation, I build an in-memory record corresponding to a 48-column Fact table record which is to be updated or inserted (depending on whether a previous version of the record is already in the Fact table). The in-memory record includes about 2 dozen foreign key pointers to various dimension tables. I have indexes on all of those Fact table foreign keys (and of course I can disable those indexes while loading the data warehouse).
After populating the in-memory record with fresh data for the Fact table record, I add it to the Fact table (update or insert, as above).
I am wondering if I could improve performance in update situations by updating only those columns which have changed, instead of mindlessly updating every column in the Fact table record. Doing this would add some overhead in that my loading program would become more complex. But would I gain any benefit anyway? Suppose for example that one of those dimension table foreign keys hasn't changed. Is there going to be a performance benefit if I do not update that particular column? I guess it comes down to whether SQL Server internally tracks (counts??) foreign key references between the Fact and dimension tables. If SQL Server does do such tracking, then perhaps there would be a performance benefit if I do not update the column, since then SQL Server would not need to perform its internal tracking operation. But if SQL Server does not do such tracking then I suppose that I should just update all 48 Fact table columns regardless of whether they have changed or not, since that would avoid adding complexity & overhead to my loading program.
I'd appreciate any comments or suggestions.

Implementing temporal tables for dimensions for tracking changes

I am working on a star schema and I want to track the history of data for some dimensions and specifically for some columns. Is it possible to work with temporal tables as an other alternative ? If yes, how to store the current record in a temporal table? Also, is it logic that the source of my dimension will be the historical table of my temporal table?

Determining if two rows or expressions are equal can be a difficult and resource intensive process. This can be the case with UPDATE statements where the update was conditional based on all of the columns being equal or not for a specific row.
To address this need in the SQL Server environment the CHECKSUM function ,in your case is helpful as it natively creates a unique expression for comparison between two records.
So you will compare between your two sources which are logically the ODS and Datawarehouse. If the Chescksum between two different sources isn't the same, you will update the old record and insert the new updated one.

Database Design : Is it a good idea to combine two tables with the same columns and too many data?

I have two table with the same columns,one is use to save bank's amount and the other to save cashdesk,they both might have many data,so i'm concerned about data retrieving speed.I don't know it's better to combine them by adding extra column to determine type of each record or create a separate table for each one?

The main question you should be asking is - how am I querying the tables.
If there is no real logical connection between the 2 tables (you don't want to get the rows in the same query) - use 2 different tables, since the other why around you will need to hold another column to tell you what type of row you are working on, and that will slow you down and make your queries more complex
In addition FKs might be a problem if the same column if a FK to 2 different places
In addition (2nd) - locks might be an issue - if you work on one type you might block the other
conclusion - 2 tables, not just for speed

In theory you have one unique entity, So you need to consider one table to your accounts and another one for your types of accounts, for better performance you could separate these types of account on two different file groups and partitions and create an index on the typeFK for account table, in this scenario you have logically one entity that is ruled by relational theory and physically your data is separated and data retrieval process would be fast and beneficial.

When should I use Oracle's Index Organized Table? Or, when shouldn't I?

Index Organized Tables (IOTs) are tables stored in an index structure. Whereas a table stored
in a heap is unorganized, data in an IOT is stored and sorted by primary key (the data is the index). IOTs behave just like “regular” tables, and you use the same SQL to access them.
Every table in a proper relational database is supposed to have a primary key... If every table in my database has a primary key, should I always use an index organized table?
I'm guessing the answer is no, so when is an index organized table not the best choice?

Basically an index-organized table is an index without a table. There is a table object which we can find in USER_TABLES but it is just a reference to the underlying index. The index structure matches the table's projection. So if you have a table whose columns consist of the primary key and at most one other column then you have a possible candidate for INDEX ORGANIZED.
The main use case for index organized table is a table which is almost always accessed by its primary key and we always want to retrieve all its columns. In practice, index organized tables are most likely to be reference data, code look-up affairs. Application tables are almost always heap organized.
The syntax allows an IOT to have more than one non-key column. Sometimes this is correct. But it is also an indication that maybe we need to reconsider our design decisions. Certainly if we find ourselves contemplating the need for additional indexes on the non-primary key columns then we're probably better off with a regular heap table. So, as most tables probably need additional indexes most tables are not suitable for IOTs.
Coming back to this answer I see a couple of other responses in this thread propose intersection tables as suitable candidates for IOTs. This seems reasonable, because it is common for intersection tables to have a projection which matches the candidate key: STUDENTS_CLASSES could have a projection of just (STUDENT_ID, CLASS_ID).
I don't think this is cast-iron. Intersection tables often have a technical key (i.e. STUDENT_CLASS_ID). They may also have non-key columns (metadata columns like START_DATE, END_DATE are common). Also there is no prevailing access path - we want to find all the students who take a class as often as we want to find all the classes a student is taking - so we need an indexing strategy which supports both equally well. Not saying intersection tables are not a use case for IOTs. just that they are not automatically so.

I'd consider them for very narrow tables (such as the join tables used to resolve many-to-many tables). If (virtually) all the columns in the table are going to be in an index anyway, then why shouldn't you used an IOT.
Small tables can be good candidates for IOTs as discussed by Richard Foote here

I consider the following kinds of tables excellent candidates for IOTs:
"small" "lookup" type tables (e.g. queried frequently, updated infrequently, fits in a relatively small number of blocks)
any table that you already are going to have an index that covers all the columns anyway (i.e. may as well save the space used by the table if the index duplicates 100% of the data)

From the Oracle Concepts guide:
Index-organized tables are useful when
related pieces of data must be stored
together or data must be physically
stored in a specific order. This type
of table is often used for information
retrieval, spatial (see "Overview of
Oracle Spatial"), and OLAP
applications (see "OLAP").
This question from AskTom may also be of some interest especially where someone gives a scenario and then asks would an IOT perform better than an heap organised table, Tom's response is:
we can hypothesize all day long, but
until you measure it, you'll never
know for sure.

An index-organized table is generally a good choice if you only access data from that table by the key, the whole key, and nothing but the key.
Further, there are many limitations about what other database features can and cannot be used with index-organized tables -- I recall that in at least one version one could not use logical standby databases with index-organized tables. An index-organized table is not a good choice if it prevents you from using other functionality.

All an IOT really saves is the logical read(s) on the table segment, and as you might have spent two or three or more on the IOT/index this is not always a great saving except for small data sets.
Another feature to consider for speeding up lookups, particularly on larger tables, is a single table hash cluster. When correctly created they are more efficient for large data sets than an IOT because they require only one logical read to find the data, whereas an IOT is still an index that needs multiple logical i/o's to locate the leaf node.

I can't per se comment on IOTs, however if I'm reading this right then they're the same as a 'clustered index' in SQL Server. Typically you should think about not using such an index if your primary key (or the value(s) you're indexing if it's not a primary key) are likely to be distributed fairly randomly - as these inserts can result in many page splits (expensive).
Indexes such as identity columns (sequences in Oracle?) and dates 'around the current date' tend to make for good candidates for such indexes.

An Index-Organized Table--in contrast to an ordinary table--has its own way of structuring, storing, and indexing data.
Index organized tables (IOT) are indexes which actually hold the data which is being indexed, unlike the indexes which are stored somewhere else and have links to actual data.