I am planning to build some wide tables in snowflake. The underlying data is highly normalized, so there is a lot of joining involved.
To illustratate the point, consider TRANSACTIONS (1b records), PRODUCTS (10k records) and PRODUCT_CATEGORY (50 records)
I would look to build:
# Creating a view in snowflake
SELECT t.*, p.productName, pc.productCategoryName
FROM TRANSACTIONS t
JOIN PPRODUCTS p ON p.product_id = t.product_id
JOIN PRODUCT_CATEGORY pc ON pc.product_category_id = p.product_category_id
My question is whether I should keep product_category_id or product_id in the view? In theory it should be quicker to query based on the (integer) Id rather than on the (string) productName or productCategoryName.
That said, I may be overthinking this. I'm new to snowflake so not 100% sure how much of this is important.
In all likelihood, you have both a natural key (the actual value that has semantic meaning to a user - like a unique product SKU or name), and a surrogate key (the _ID that is a unique integer value for each of your natural key values).
Generally, the surrogate key (assumed as integer) which you base your join upon will perform better than the natural key (assumed as string) - so you should use the surrogate key in the join conditions.
As for utility, I would include both the natural and surrogate key values in the select clause of the view. This allows the view to be used with other tables / views where you can join on the surrogate key there as well.
Related
I have a db that supports a free-form addition of values to products. I'm using a key-value structure because individual products can have wildly different key/value pairs associated with them, otherwise I'd just make a column.
Give a table for products and a table for key-value pairs, I want to know what kind of indexes to set up to best support this.
Tables:
Products: productId(pk), name, category
ProductDetails: productId(fk), name, value(text)
Frequently used queries I want to be fast:
SELECT * from ProductDetails pd where pd.productId = NNN
SELECT * from ProductDetails pd where pd.name='advantages' and pd.value like '%forehead laser%`
I'd encourage you to comb through this answer: How to create composite primary key in SQL Server 2008
This all depends on your querying and data constraint needs. You can have a clustered index on just the productId and add multiple non-clustered indexes on other composite keys (ProductDetails.name and .value and/or productId too). These can also enforce the uniqueness of the data being inserted so you don't get duplicates.
Be aware though there are diminishing returns on adding too many indexes on large tables where inserts/updates need to occur as well. The db has to determine where the row should go in relation to each index.
I'm just curious here. If I have two tables, let's say Clients and Orders.
Clients have a unique and primary key ID_Client. Orders have an ID_Client field also and a relation to maintain integrity to Client's table by ID_Client field.
So when I want to join both tables i do:
SELECT
Orders.*, Clients.Name
FROM
Orders
INNER JOIN
Clients ON Clients.ID_Client = Orders.ID_Client
So if I took the job to create the primary key, and the relation between the tables,
Is there a reason why I need to explicitly include the joined columns in on clause?
Why can't I do something like:
SELECT
Orders.*, Clients.Name
FROM
Orders
INNER JOIN
Clients
So SQL should know which columns relate both tables...
I had this same question once and I found a great explanation for it on Database Administrator Stack Exchange, the answer below was the one that I found to be the best, but you can refer to the link for additional explanations as well.
A foreign key is meant to constrain the data. ie enforce
referential integrity. That's it. Nothing else.
You can have multiple foreign keys to the same table. Consider the following where a shipment has a starting point, and an ending point.
table: USA_States
StateID
StateName
table: Shipment
ShipmentID
PickupStateID Foreign key
DeliveryStateID Foreign key
You may want to join based on the pickup state. Maybe you want to join on the delivery state. Maybe you want to perform 2 joins for
both! The sql engine has no way of knowing what you want.
You'll often cross join scalar values. Although scalars are usually the result of intermediate calculations, sometimes you'll have a
special purpose table with exactly 1 record. If the engine tried to
detect a foriegn key for the join.... it wouldn't make sense because
cross joins never match up a column.
In some special cases you'll join on columns where neither is unique. Therefore the presence of a PK/FK on those columns is
impossible.
You may think points 2 and 3 above are not relevant since your questions is about when there IS a single PK/FK relationship
between tables. However the presence of single PK/FK between the
tables does not mean you can't have other fields to join on in
addition to the PK/FK. The sql engine would not know which fields you
want to join on.
Lets say you have a table "USA_States", and 5 other tables with a FK to the states. The "five" tables also have a few foreign keys to
each other. Should the sql engine automatically join the "five" tables
with "USA_States"? Or should it join the "five" to each other? Both?
You could set up the relationships so that the sql engine enters an
infinite loop trying to join stuff together. In this situation it's
impossible fore the sql engine to guess what you want.
In summary: PK/FK has nothing to do with table joins. They are separate unrelated things. It's just an accident of nature that you
often join on the PK/FK columns.
Would you want the sql engine to guess if it's a full, left, right, or
inner join? I don't think so. Although that would arguably be a lesser
sin than guessing the columns to join on.
If you don't explicitly give the field names in the query, SQL doesn't know which fields to use. You won't always have fields that are named the same and you won't always be joining on the primary key. For example, a relationship could be between two foreign key fields named "Client_Address" and "Delivery_Address". In that case, you can easily see how you would need to give the field name.
As an example:
SELECT o.*, c.Name
FROM Clients c
INNER JOIN Orders o
ON o.Delivery_Address = c.Client_Address
Is there a reason why do i need to explicit include then joinned fields in on clause?
Yes, because you still need to tell the database server what you want. "Do what I mean" is not within the capabilities of any software system so far.
Foreign keys are tools for enforcing data integrity. They do not dictate how you can join tables. You can join on any condition that is expressible through an SQL expression.
In other words, a join clause relates two tables to each other by a freely definable condition that needs to evaluate to true given the two rows from left hand side and the right hand side of the join. It does not have to be the foreign key, it can be any condition.
Want to find people that have last names equal to products you sell?
SELECT
Products.Name,
Clients.LastName
FROM
Products
INNER JOIN Clients ON Products.Name = Clients.LastName
There isn't even a foreign key between Products and Clients, still the whole thing works.
It's like that. :)
The sql standard says that you have to say on which columns to join. The constraints are just for referential integrity. With mysql the join support "join table using (column1, column2)" but then those columns have to be present in both tables
Reasons why this behaviour is not default
Because one Table can have multiple columns referencing back to one column in another table.
In a lot of legacy databases there are no Foreign key constraints but yet the columns are “Supposed to be” referencing some column in some other table.
The join conditions are not always as simple as A.Column = B.Column . and the list goes on…….
Microsoft developers were intelligent enough to let us make this decision rather than them guessing that it will always be A.Column = B.Column
For a large table of transactions (100 million rows, 20 GB) that already has a primary key (a natural composite key of 4 columns), will it help performance to add an identity column and make that the primary key?
The current primary key (the natural composite primary key of 4 columns) does the job, but I have been told that you should always have a surrogate key. So, could improve performance by creating an identity column and making that the primary key?
I'm using SQL Server 2008 R2 database.
EDIT: This transaction table is mainly joined to definition tables and used to populate reports.
EDIT: If I did add a surrogate key, it wouldn't be used in any joins. The existing key fields would be used.
EDIT: There would be no child tables to this table
Just adding an IDENTITY column and adding a new constraint and index for it is very unlikely to improve performance. The table will be larger and therefore scans and seeks could take longer. There will also be more indexes to update. Of course it all depends what you are measuring the performance of... and whether you intend to make other changes to code or database when you add the new column. Adding an IDENTITY column and doing nothing else would probably be unwise.
Only if:
you have child tables that are larger
you have nonclustered indexes
In each of these cases, the PK (assumed clustered) of your table will be in each child entry/NC entry. So making the clustered key narrower will benefit.
If you have just non NC indexes (maybe one) and no child tables all you'll achieve is
a wider row (more data pages used)
a slightly smaller B-tree (which is a fraction of total space)
...but you'll still need an index/constraint on the current 4 columns anyway = an increase in space.
If your 4 way key capture parent table keys too (sounds likely) then you'd lose the advantage of overlap. This would be covered by the new index/constraint though.
So no, you probably don't want to do it.
We threw away a surrogate key (bigint) on a billion+ row table and moved to the actual 11-way key and reduced space on disk by 65%+ because of a simpler structure (one less index, slighty more rows per page etc)
Given your edits, and all the conversation the question has sparked notwithstanding, I would suggest that adding an IDENTITY column to this table will create a lot more harm than benefit.
One place where performance is hurt is on the change of the data in the natural key. The change woudl then have to promulgate to all the child records. For instance, suppose one of those fields was company name and the company changed their name, then all the related records, and there could be millions of them, would have to change but if you used a surrogate key, only one record would have to change. Integer joins tend to be faster (generally much faster than 4 column joins) and wrting the code to join is generally faster as well. However, on the other hand, having the vital four fields may mean the join isn't needed as often. Insert performanc ewilltake a slight hit as well as the surrogate key has to be generated and indexed. Usually this is so small a hit as to be unnoticalbe but the possibility is there.
A four column natural key is often not a unique as you think it will be because that number of columns the data tends to change over time. While it is unique now, will it be unique over time? If you have used a surrogate key and a unique index onteh natural key and it turns out later not to unique, then all you have to do is drop the unique index. If it is the PK and there are child tables, you have to totally redesign your database.
Only you can decide which if any of these considerations affects your specific data needs, surrogate keys are better for some applications and worse for others.
---EDIT:
Based on the edits to the question, adding an identity/surrogate key might not be the solution to this problem.
--Original Answer.
One case of performance improvement would be when you use joins and when you have child tables.
In the absence of surrogate keys, you would have to replicate all th4 4 keys to the child table and join on the 4 columns.
t_parent
-------------
col1,
col2,
col3,
col4,
col5,
constraint pk_t_parent primary key (col1,col2,col3,col4)
t_child
----------
col1,
col2,
col3,
col4,
col7,
col8,
constraint pk_t_child primary key (col1,col2,col3,col4, col5),
constraint fk_parent_child foreign key (col1, col2, col3, col4) references
t_parent ((col1, col2, col3, col4))
The joins will include all the 4 columns..
select t2.*
from t_parent t1, t_child t2
where (t1.col1 = t2.col1 and
t1.col2 = t2.col2 and
t1.col3 = t2.col3 and
t1.col4 = t2.col4
)
If you use a surrogate key and create a unique constraint on the 4 columns (which are now part of the primary key), it will be both efficient and the data would still be validated as before.
I define table 1.
Now i want to add new table - and one of the column of the second table need to be the first table -
How can i do it ?
If you want one column of a table to reference another table, then your best bet is probably to go read up on the concept of keys, primary keys, and foreign keys in database design.
For example, in a database of companies and employees, you might have 2 tables like this:
Company (c_id, name, city)
Employee (e_id, c_id, name)
In the Company table, c_id would be a primary key. In the Employee table, c_id would be a foreign key referencing Company. This would allow you to do queries like
SELECT E.name
FROM Employee as E, Company as C
WHERE E.c_id = C.c_id AND C.name = "IBM"
which would return the names of employees who work at IBM.
Links:
http://en.wikipedia.org/wiki/Primary_key
http://en.wikipedia.org/wiki/Foreign_key
Why cant you go for a foreign relationship.
for eg : Table1 (ID,ForeignKeyId, other columns)
Table2 (ID,other columns)
ForigenKeyId will be the primary key of Table2
If you really need table as a column, you should read http://msdn.microsoft.com/en-us/library/ms175010.aspx for a solution. However, this is highly unlikely that you really need table column datatype, as it is primarily used for temporary storage.
If you don't know primary-foreign keys relationships stuff, you should take some time learning relational databases or have someone design a database schema for you based on business entities and your application needs. Otherwise you will end up with a design which is completely unmaintainable and mid term it will backfire on you.
If you need a quick reading on PK/FK topic, please read http://www.functionx.com/sqlserver2005/Lesson13.htm. It should give you a knowledge required to tackle with this particular issue.
There's a healthy debate out there between surrogate and natural keys:
SO Post 1
SO Post 2
My opinion, which seems to be in line with the majority (it's a slim majority), is that you should use surrogate keys unless a natural key is completely obvious and guaranteed not to change. Then you should enforce uniqueness on the natural key. Which means surrogate keys almost all of the time.
Example of the two approaches, starting with a Company table:
1: Surrogate key: Table has an ID field which is the PK (and an identity). Company names are required to be unique by state, so there's a unique constraint there.
2: Natural key: Table uses CompanyName and State as the PK -- satisfies both the PK and uniqueness.
Let's say that the Company PK is used in 10 other tables. My hypothesis, with no numbers to back it up, is that the surrogate key approach would be much faster here.
The only convincing argument I've seen for natural key is for a many to many table that uses the two foreign keys as a natural key. I think in that case it makes sense. But you can get into trouble if you need to refactor; that's out of scope of this post I think.
Has anyone seen an article that compares performance differences on a set of tables that use surrogate keys vs. the same set of tables using natural keys? Looking around on SO and Google hasn't yielded anything worthwhile, just a lot of theorycrafting.
Important Update: I've started building a set of test tables that answer this question. It looks like this:
PartNatural - parts table that uses
the unique PartNumber as a PK
PartSurrogate - parts table that
uses an ID (int, identity) as PK and
has a unique index on the PartNumber
Plant - ID (int, identity) as PK
Engineer - ID (int, identity) as PK
Every part is joined to a plant and every instance of a part at a plant is joined to an engineer. If anyone has an issue with this testbed, now's the time.
Use both! Natural Keys prevent database corruption (inconsistency might be a better word). When the "right" natural key, (to eliminate duplicate rows) would perform badly because of length, or number of columns involved, for performance purposes, a surrogate key can be added as well to be used as foreign keys in other tables instead of the natural key... But the natural key should remain as an alternate key or unique index to prevent data corruption and enforce database consistency...
Much of the hoohah (in the "debate" on this issue), may be due to what is a false assumption - that you have to use the Primary Key for joins and Foreign Keys in other tables. THIS IS FALSE. You can use ANY key as the target for foreign keys in other tables. It can be the Primary Key, an alternate Key, or any unique index or unique constraint., as long as it is unique in the target relation (table). And as for joins, you can use anything at all for a join condition, it doesn't even have to be a key, or an index, or even unique !! (although if it is not unique you will get multiple rows in the Cartesian product it creates). You can even create a join using non-specific criterion (like >, <, or "like" as the join condition.
Indeed, you can create a join using any valid SQL expression that evaluate to a boolean.
Natural keys differ from surrogate keys in value, not type.
Any type can be used for a surrogate key, like a VARCHAR for the system-generated slug or something else.
However, most used types for surrogate keys are INTEGER and RAW(16) (or whatever type your RDBMS does use for GUID's),
Comparing surrogate integers and natural integers (like SSN) takes exactly same time.
Comparing VARCHARs make take collation into account and they are generally longer than integers, that making them less efficient.
Comparing a set of two INTEGER is probably also less efficient than comparing a single INTEGER.
On datatypes small in size this difference is probably percents of percents of the time required to fetch pages, traverse indexes, acquite database latches etc.
And here are the numbers (in MySQL):
CREATE TABLE aint (id INT NOT NULL PRIMARY KEY, value VARCHAR(100));
CREATE TABLE adouble (id1 INT NOT NULL, id2 INT NOT NULL, value VARCHAR(100), PRIMARY KEY (id1, id2));
CREATE TABLE bint (id INT NOT NULL PRIMARY KEY, aid INT NOT NULL);
CREATE TABLE bdouble (id INT NOT NULL PRIMARY KEY, aid1 INT NOT NULL, aid2 INT NOT NULL);
INSERT
INTO aint
SELECT id, RPAD('', FLOOR(RAND(20090804) * 100), '*')
FROM t_source;
INSERT
INTO bint
SELECT id, id
FROM aint;
INSERT
INTO adouble
SELECT id, id, value
FROM aint;
INSERT
INTO bdouble
SELECT id, id, id
FROM aint;
SELECT SUM(LENGTH(value))
FROM bint b
JOIN aint a
ON a.id = b.aid;
SELECT SUM(LENGTH(value))
FROM bdouble b
JOIN adouble a
ON (a.id1, a.id2) = (b.aid1, b.aid2);
t_source is just a dummy table with 1,000,000 rows.
aint and adouble, bint and bdouble contain exactly same data, except that aint has an integer as a PRIMARY KEY, while adouble has a pair of two identical integers.
On my machine, both queries run for 14.5 seconds, +/- 0.1 second
Performance difference, if any, is within the fluctuations range.