Federated queries in SnowFlake? - snowflake-cloud-data-platform

I have three organizations which want to collaborate. All three of them have the same backend database and tables, and want to run a federated query across these three tables. Is that possible using snowflake?

If they each have one "table" each, and data share it to the other two, that can have the three "tables" and
SELECT a.*, b.*, c.*
FROM mytable AS a
JOIN their_table_one AS b
JOIN the_other_table AS c
just fine.

You can import all tables into Snowflake and then create views that combine these tables so that they are visible as one view.
Example:
CREATE VIEW Table1_v
AS
SELECT col1, col2, col3, 'Source A' AS src
FROM SourceA_Table1
UNION ALL
SELECT col1, col2, col3, 'Source B' AS src
FROM SourceB_Table1
UNION ALL
SELECT col1, col2, col3, 'Source C' AS src
FROM SourceC_Table1;

Related

Select everything except one column in ARRAY_AGG + STRUCT in BigQuery

I'm currently using ARRAY_AGG + STRUCT to nest all the fields in my table under one column and yes, it works. The problem is that the solution I'm using is not scalable. I'd like to select all the fields except for one in my STRUCT but I've no idea on how to do this. Here the sample:
SELECT
Col1,
ARRAY_AGG(STRUCT(Col2,
Col3,
Col4,
Col5,
Col6,
Col7)) OVER (PARTITION BY Col3, Col4, Col5)
FROM
Source
And here what I'd like to have:
SELECT
Col1,
ARRAY_AGG(STRUCT(* EXCEPT(Col1))) OVER (PARTITION BY Col3, Col4, Col5)
FROM
Source
Below is for BigQuery Standard SQL
#standardsql
select col1,
array_agg((select as struct * except(col1) from unnest([t])))
over(partition by col3, col4, col5)
from source t

SQL query optimization for select query over IN query

I have one view, i want to add pagination logic on this view. There are over 1.5 million records. It took longer time to get result if for my where condition that select only specific records mapped with one Id.
I am thinking of getting only those mapped records from main table and then select only those records from view, will this faster?
Select top 10 col1, col2, col3, ROW_NUMBER() OVER (ORDER BY col4 desc) from vMyView where someid=1
Then
Select top 10 col1, col2, col3 from vMyView where col1 in (Select col1 from tMyTable where someid=1)
FYI I am not expert
Assuming typical cardinality, I tend to write it more like this:
select top 10 col1, col2, col3
from vMyView v
inner join tMyTable t ON t.col1 = v.col1
WHERE t.someid = 1
However, if it's possible to match more than one row in tMyTable for each col1 value in vMyView, this could possibly result in duplicating rows from vMyView. If duplicating rows is possible, a solution based on row_number() is typically the fastest option.
i want to add pagination logic on this view
As for paging, you should look into OFFSET/FETCH syntax, rather than TOP n.
SELECT col1, col2, col3
FROM vMyView v
ORDER BY <need an order by clause for paging to work>
OFFSET <pagenumber * pagesize> FETCH NEXT <pagesize> ROWS ONLY

Is it possible to create SQL query templates?

I have a couple of tables as data sources which have extremely similar structure. I only care about some columns of them and I want to join them. So what I do at the moment is:
SELECT 'table_a' AS source, col1, col2, col3, col4
FROM table_a as source_table
INNER JOIN other on source_table.id = other.id
UNION ALL
SELECT 'table_b' AS source, col1, col2, col3, col4
FROM table_b as source_table
INNER JOIN other on source_table.id = other.id
UNION ALL
SELECT 'table_c' AS source, col1, col2, col3, col4
FROM table_c as source_table
INNER JOIN other on source_table.id = other.id
UNION ALL
SELECT 'table_d' AS source, col1, col2, col3, col4
FROM table_d as source_table
INNER JOIN other on source_table.id = other.id
I would like to do something like this:
query(param1, param2) := {
SELECT param1 AS source, col1, col2, col3, col4
FROM param2 as source_table
INNER JOIN other on source_table.id = other.id
}
query('table_a', table_a)
UNION ALL
query('table_b', table_b)
UNION ALL
query('table_c', table_c)
UNION ALL
query('table_d', table_d)
I know how to do this within the programming language (using a templating engine and constructing the query string).
Is something like this possible within SQL (Snowflake Warehouse)?
You can't do exactly that I'm afraid. However, you can use Snowflake Stored Procedures (SP) to effectively achieve this. You can construct the SQL query text in SP based on the parameters passed to it, and then executing it. You can e.g. pass to it an array of table names etc.
One problem is that today SPs in Snowflake do not return the result of the query directly. To overcome this, you can e.g. save the result of the query in a new table (with the name hardcoded in SP, or passed as a parameter to SP) and then query it with a separate SELECT.

mssql checksum on different tables

I need to find if two rows (one having the same id of the other +50000) are the same. Is there any way to make this query work?
select 1
from table1 c1,
table2 c2
where c1.id=c2.id+50000 and CHECKSUM(c1.*) = CHECKSUM(c2.*)
CHECKSUM() apparently does not accept "table.*" expressions. It accepts either "*" alone or list of columns, but I can't do that as this query needs to work also for other tables with other columns.
EDIT: I just realized that CHECKSUM() will not work as the value will always be different if the IDs are different....
The original question still holds out of curiosity.
Try something like this, it will work for most datatypes (not TEXT and some others):
SELECT 1
FROM
table1 c1
JOIN
table2 c2
ON
c1.id=c2.id+50000 and
EXISTS(SELECT c1.col1, col2, col3, col4 EXCEPT SELECT c2.col1, col2, col3, col4)
You can do it using derived tables:
SELECT
SUM(CASE
WHEN a.cs <> b.cs THEN 1
ELSE 0
END)
FROM (SELECT RowNumber, CHECKSUM(*) AS cs FROM #A) a
FULL OUTER JOIN (SELECT RowNumber, CHECKSUM(*) AS cs FROM #B) b
ON a.RowNumber=b.RowNumber;
This is an excerpt from a script I've written previously. I have not changed any of the object names to match your example. The result of this query is the number of differences between #A and #B where the RowNumber columns match.
To apply to your need, you can create two temporary tables, populating them from the originals, but replacing the ID column with a "RowNumber" column that matches between the rows you want to match (ie: c1.id=c2.id+50000). That way, you don't have mis-matched IDs to interfere with the CHECKSUM.

What's the most efficient syntax to use Merge to upsert many rows at once?

There's 2 ways I've found of upserting many rows into a table with SQL Server 2008.
One of which is found here http://technet.microsoft.com/en-us/library/bb522522(v=sql.105).aspx says to create a temp table, then insert values to temp table, and finally merge that table with target able.
This doesn't seem very efficient to me because you have to create a table, fill the table, merge to target table, and then delete the temp table.
The only other thing I can think of is as follows...
MERGE dbo.targettable as tgt
USING (
SELECT 12 as col1, 13 as col2, 'abc' as col3, 'zyx' as col4
UNION ALL
SELECT 11 as col1, 11 as col2, 'def' as col3, 'def' as col4
(etc etc)
UNION ALL
SELECT 7 as col1, 10 as col2, 'jfj' as col3, 'tub' as col4)
as new
ON tgt.col1=new.col1
WHEN MATCHED THEN UPDATE SET tgt.col2=new.col2, tgt.col3=new.col3, tgt.col4=new.col4
WHEN NOT MATCHED THEN INSERT (col1, col2, col3, col4)
VALUES(new.col1, new.col2, new.col3, new.col4);
Based on usr's answer I was able to find http://msdn.microsoft.com/en-us/library/bb510625.aspx
I think this is the way to do it. Could someone verify that this syntax appears correct?
MERGE dbo.targettable as tgt
USING (VALUES(12, 13, 'abc', 'zyx'), (11, 11, 'def', 'def'),(7, 10, 'jfj', 'tub'))
AS new (col1, col2, col3, col4)
ON tgt.col1=new.col1
WHEN MATCHED THEN UPDATE SET tgt.col2=new.col2, tgt.col3=new.col3, tgt.col4=new.col4
WHEN NOT MATCHED THEN INSERT (col1, col2, col3, col4)
VALUES(new.col1, new.col2, new.col3, new.col4);
Where does the data to be merged come from?
If it comes from a query, inline the query into the merge.
If it
comes from the app, use table-valued parameters.
If it is generated
iteratively, use a temp table or table variable.
If it is a constant like in your example use the VALUES clause. Don't use UNION ALL because it is more verbose, does not document semantics nicely and increases query compile time because the optimizer has to convert it to VALUES form.

Resources