I am new to Snowflake and want to know, can we use hashcodes for joining tables or finding unique records or deleting duplicate records in Snowflake(or in any other database in general)?
I am designing an ETL flow, what are the advantages or disadvantages of using hashcodes and why are they generally not used often in most Data warehousing designs?
If you mean hashing with something like md5_binary or sha1_binary then yes absolutely,
Binary values are half the byte length of the equivalent varchar length and so you should use that. The benefit of using hash-keys (effectively) is that you only need a single join column if for instance the natural keys of a table might be a composite key. Now you could instead a numeric/int data type, sequence key but that imposes a load order. Example only after the related dimension tables have loaded should you build the related fact table --- if you are doing that.
Data Vault prefers durable hash-keys because it does not impose any load ordering, load in any order independently.
Anyway I digress, yes hash-keys have great advantages, just make sure they're binary data types when loaded.
Related
I am creating a dimensional data model for implementation in SAP Hana. In Dimensional modeling, having surrogate keys for dimension tables is mandatory, however I am told that in SAP Hana, we cannot define surrogate keys and have to depend on the natural keys for the dimensions. I have never come across this before, especially using natural keys for SCD dimensions is not possible.
Any suggestion on implementing surrogate keys in Hana will be great.
SAP HANA supports, just like most other RDMBS, the automatic generation of surrogate (synthetic) keys. The feature name for this is IDENTITY column. There are also key value generating functions like SYSGUUID() available that generate guaranteed globally unique numbers.
This covers the feature for current databases, i.e. databases that represent only the most current state of information.
For the example you mentioned (slowly changing dimensions, SCD, type 2), you need to bring in a concept of during which timeframe any dimension entry is considered current. You need to create a temporal database. One way to do that is to add validFrom/validTo fields to your dimension tables and fill them accordingly during data loading.
SAP HANA supports this type of modelling with a feature called temporal join that allows an easy match of fact data to a temporal dimension table.
Considering these features and the fact that SAP’s own data warehouse solution SAP BW/4 HANA manages slowly changing dimensions on SAP HANA, I’d say that the claim you heard is incorrect.
I'm wondering what the best way to setup the keys for a table holding activity stream data. Each activity type will have different attributes (with some common ones). Here is an example of what some items will consist of:
A follow activity:
type
user_id
timestamp
follower_user_id
followee_user_id
A comment activity
type
user_id
timestamp
comment_id
commenter_user_id
commented_user_id
For displaying the stream I will be querying against the user_id and ordering by timestamp. There will also be other types of queries - for example I will occasionally need to query user_id AND type as well as stuff like comment_id, follower_user_id etc.
So my questions are:
Should my primary key be a hash and range key using user_id and timestamp?
Do I need secondary indexed for every other item - e.g. comment_id or will results return quick enough without the index? Secondary indexes are limited to 5 which wouldn't be enough for all the types of queries I will need to perform.
I'd consider whether you could segment the data into two (or more) tables - allowing better use of your queries. Combine the two as (and if) needed, ie - your type becomes your table rather than a discriminator like you would do in SQL
If you don't separate the tables, then my answers would be
Yes - I think that would be the best bet given that it seems like most of the time, that will be the way you are using it.
No. But you do need to consider what the most frequent queries are and the performance considerations around it. Which ones need to be performant - and which ones are "good enough" good enough?
A combination of caching and asynchronous processing can allow a slow performing scan to be good enough - but it doesn't eliminate the requirement to have some local secondary indexes.
In a simple Data Base design, entity tables have IDs (mostly auto increment)
But there are some system e.g. vtiger CRM that use a master table to store all newly created IDS.
My question is:
What is the benefits of described approach.
What is the name of described approach, if any. I mean what do designers call this
method?
moodle is another example of this method too. An example in Moodle:
mdl_context has all IDs of other modules:
mdl_context - id - contextlevel - instanceid - path - depth
values - 115 - 50 - 17 - /1/84/90/115 - 4
instanceid is the ID of other entity and contextlevel shows the table, for example 50 is a code for course table.
Without having mdl_context, mdl_course had it's own ID, so why does `mdl_course exists?
You may simply think about this when your database doesn't support auto increment columns and you would have to implement auto incremental values yourself.
Or due to limitations of specific implementation of auto increment in a database, based on you business rules, you need to customize auto increment module.
for example
When gaps in the column values, are important to NOT Happens.
Consider the selling scenario in witch you need to have exact sequence of numbers for billing_ number column. Using an auto increment approach will cause some problems:
1- If any bill, would be rejected you would lose a number (Rollback scenario)
2- In case of DELETE operation on Billing table (if happens) you will lose a number(Delete scenario)
3- In some distributed(clustered) DB environments like Oracle RAC (having multiple RDBMS nodes) and using oracle sequences as auto increment strategy, we must use a CACHE interval to maintain integrity, so again some numbers will be lost.
In these cases you may use a metadata table like crm_entity holding last used value per table on it(or any other information if needed). locking the metadata table will be inevitable, so in heavy TPS, there will be performance issue.
SQL DBMSs typically provide a key generator feature that can be directly associated with a column in a table, variously known as Identity or auto-incrementing columns. These suffer certain disadvantages however. The syntax is often highly proprietary and awkward to work with and the key generator usually comes with inbuilt limitations, such as not permitting updates or inserts or only allowing one such column per table. Table-based generator functions normally only work on insert, which means the value can't be accessed and used until after the row has been inserted, and they are associated with one table only, making it impossible to generate key values that are shared and distributed between tables.
To overcome those and other limitations, table-independent key generators are often used instead. Some DBMSs (Oracle, SQL Server) support this directly with special Sequence-generator objects that are independent of tables but other DBMSs do not. So keeping a sequence-generating table separate from other tables is a useful general way to create sequences without relying on DBMS-specific features.
Index Organized Tables (IOTs) are tables stored in an index structure. Whereas a table stored
in a heap is unorganized, data in an IOT is stored and sorted by primary key (the data is the index). IOTs behave just like “regular” tables, and you use the same SQL to access them.
Every table in a proper relational database is supposed to have a primary key... If every table in my database has a primary key, should I always use an index organized table?
I'm guessing the answer is no, so when is an index organized table not the best choice?
Basically an index-organized table is an index without a table. There is a table object which we can find in USER_TABLES but it is just a reference to the underlying index. The index structure matches the table's projection. So if you have a table whose columns consist of the primary key and at most one other column then you have a possible candidate for INDEX ORGANIZED.
The main use case for index organized table is a table which is almost always accessed by its primary key and we always want to retrieve all its columns. In practice, index organized tables are most likely to be reference data, code look-up affairs. Application tables are almost always heap organized.
The syntax allows an IOT to have more than one non-key column. Sometimes this is correct. But it is also an indication that maybe we need to reconsider our design decisions. Certainly if we find ourselves contemplating the need for additional indexes on the non-primary key columns then we're probably better off with a regular heap table. So, as most tables probably need additional indexes most tables are not suitable for IOTs.
Coming back to this answer I see a couple of other responses in this thread propose intersection tables as suitable candidates for IOTs. This seems reasonable, because it is common for intersection tables to have a projection which matches the candidate key: STUDENTS_CLASSES could have a projection of just (STUDENT_ID, CLASS_ID).
I don't think this is cast-iron. Intersection tables often have a technical key (i.e. STUDENT_CLASS_ID). They may also have non-key columns (metadata columns like START_DATE, END_DATE are common). Also there is no prevailing access path - we want to find all the students who take a class as often as we want to find all the classes a student is taking - so we need an indexing strategy which supports both equally well. Not saying intersection tables are not a use case for IOTs. just that they are not automatically so.
I'd consider them for very narrow tables (such as the join tables used to resolve many-to-many tables). If (virtually) all the columns in the table are going to be in an index anyway, then why shouldn't you used an IOT.
Small tables can be good candidates for IOTs as discussed by Richard Foote here
I consider the following kinds of tables excellent candidates for IOTs:
"small" "lookup" type tables (e.g. queried frequently, updated infrequently, fits in a relatively small number of blocks)
any table that you already are going to have an index that covers all the columns anyway (i.e. may as well save the space used by the table if the index duplicates 100% of the data)
From the Oracle Concepts guide:
Index-organized tables are useful when
related pieces of data must be stored
together or data must be physically
stored in a specific order. This type
of table is often used for information
retrieval, spatial (see "Overview of
Oracle Spatial"), and OLAP
applications (see "OLAP").
This question from AskTom may also be of some interest especially where someone gives a scenario and then asks would an IOT perform better than an heap organised table, Tom's response is:
we can hypothesize all day long, but
until you measure it, you'll never
know for sure.
An index-organized table is generally a good choice if you only access data from that table by the key, the whole key, and nothing but the key.
Further, there are many limitations about what other database features can and cannot be used with index-organized tables -- I recall that in at least one version one could not use logical standby databases with index-organized tables. An index-organized table is not a good choice if it prevents you from using other functionality.
All an IOT really saves is the logical read(s) on the table segment, and as you might have spent two or three or more on the IOT/index this is not always a great saving except for small data sets.
Another feature to consider for speeding up lookups, particularly on larger tables, is a single table hash cluster. When correctly created they are more efficient for large data sets than an IOT because they require only one logical read to find the data, whereas an IOT is still an index that needs multiple logical i/o's to locate the leaf node.
I can't per se comment on IOTs, however if I'm reading this right then they're the same as a 'clustered index' in SQL Server. Typically you should think about not using such an index if your primary key (or the value(s) you're indexing if it's not a primary key) are likely to be distributed fairly randomly - as these inserts can result in many page splits (expensive).
Indexes such as identity columns (sequences in Oracle?) and dates 'around the current date' tend to make for good candidates for such indexes.
An Index-Organized Table--in contrast to an ordinary table--has its own way of structuring, storing, and indexing data.
Index organized tables (IOT) are indexes which actually hold the data which is being indexed, unlike the indexes which are stored somewhere else and have links to actual data.
I am currently planning to develop a music streaming application. And i am wondering what would be better as a primary key in my tables on the server. An ID int or a Unique String.
Methods 1:
Songs Table:
SongID(int), Title(string), *Artist**(string), Length(int), *Album**(string)
Genre Table
Genre(string), Name(string)
SongGenre:
***SongID****(int), ***Genre****(string)
Method 2
Songs Table:
SongID(int), Title(string), *ArtistID**(int), Length(int), *AlbumID**(int)
Genre Table
GenreID(int), Name(string)
SongGenre:
***SongID****(int), ***GenreID****(int)
Key: Bold = Primary Key, *Field** = Foreign Key
I'm currently designing using method 2 as I believe it will speed up lookup performance and use less space as an int takes a lot less space then a string.
Is there any reason this isn't a good idea? Is there anything I should be aware of?
Is there any reason this isn't a good idea? Is there anything I should be aware of?
Yes. Integer IDs are very bad if you need to uniquely identify the same data outside of a single database. For example, if you have to copy the same data into another database system with potentially pre-existing data or you have a distributed database. The biggest thing to be aware of is that an integer like 7481 has no meaning outside of that one database. If later on you need to grow that database, it may be impossible without surgically removing your data.
The other thing to keep in mind is that integer IDs aren't as flexible so they can't easily be used for exceptional cases. The designers of the Internet Protocol understood this and took precautions by allocating certain blocks of numbers as "special" in one way or another (broadcast IPs, private IPs, network IPs). But that was only possible because there's a protocol surrounding the usage of those numbers. Many databases don't operate within such a well-defined protocol.
FWIW, it's kind of like trying to decide if having a "strongly typed" programming paradigm is better than a "weakly/dynamically typed" programming paradigm. It will depend on what you need to do.
You are doing the right thing - identity field should be numeric and not string based, both for space saving and for performance reasons (matching keys on strings is slower than matching on integers).
From the software perspective the GUID is better as its unique globally.
Quotes from: Primary Keys: IDs versus GUIDs
Using a GUID as a row identity value feels more natural-- and
certainly more truly unique-- than a 32-bit integer. Database guru Joe
Celko seems to agree. GUID primary keys are a natural fit for many
development scenarios, such as replication, or when you need to
generate primary keys outside the database. But it's still a question
of balancing the tradeoffs between traditional 4-byte integer IDs and
16-byte GUIDs:
GUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if
you're not careful
Cumbersome to debug where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}'
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of
clustered indexes
My recommendation is: use ids.
You'll be able to rename that "Genre" with 20000 songs without breaking anything.
The idea behind this is that the id identifies the row in the table. Whatever the row has is something that doesn't matters in this problem.
This is in large part a matter of personal preference.
My personal opinion and practice is to always use integer keys and to always use surrogate rather than natural keys (so never use anything like social security number or the genre name directly).
There are cases where an auto number field is not appropriate or does not scale. In these cases it can make sense to use a GUID, which can be a string in databases that do not have a native datatype for it.