How to find column level lineage information with in the snowflake

How to find column level lineage information with in the snowflake - snowflake-cloud-data-platform

I am trying to find column level lineage information with in the snowflake.
Few blogs say's that we can build lineage from the data present in the Access_History view which is under Account Usage Schema but I could not find the relevant info from this view. Can anybody clarify on this.

Column Lineage
Column lineage (i.e. access history for columns) extends the Account Usage ACCESS_HISTORY view to specify how data flows from the source column to the target column in a write operation. Snowflake tracks the data from the source columns through all subsequent table objects that reference data from the source columns (e.g. INSERT, MERGE, CTAS) provided that objects in the lineage chain are not dropped. Snowflake makes column lineage accessible by enhancing the OBJECTS_MODIFIED column in the ACCESS_HISTORY view.
Example: Column Lineage
The following example queries the ACCESS_HISTORY view and uses the FLATTEN function to flatten the OBJECTS_MODIFIED column.
...
select
directSources.value: "objectId" as source_object_id,
directSources.value: "objectName" as source_object_name,
directSources.value: "columnName" as source_column_name,
'DIRECT' as source_column_type,
om.value: "objectName" as target_object_name,
columns_modified.value: "columnName" as target_column_name
from
(
select
*
from
snowflake.account_usage.access_history
) t,
lateral flatten(input => t.OBJECTS_MODIFIED) om,
lateral flatten(input => om.value: "columns", outer => true) columns_modified,
lateral flatten(
input => columns_modified.value: "directSources",
outer => true
) directSources
union
select
baseSources.value: "objectId" as source_object_id,
baseSources.value: "objectName" as source_object_name,
baseSources.value: "columnName" as source_column_name,
'BASE' as source_column_type,
om.value: "objectName" as target_object_name,
columns_modified.value: "columnName" as target_column_name
from
(
select
*
from
snowflake.account_usage.access_history
) t,
lateral flatten(input => t.OBJECTS_MODIFIED) om,
lateral flatten(input => om.value: "columns", outer => true) columns_modified,
lateral flatten(
input => columns_modified.value: "baseSourceColumns",
outer => true
) baseSources
;

Try the OBJECT_DEPENDANCIES view, documented here: https://docs.snowflake.com/en/sql-reference/account-usage/object_dependencies.html

Related

snowflake - select all columns without writing out each column and unnest one

Hello I need to create a new table, from another table that has a nested column (EX: metrics col.) that I need to unnest for the new table without writing out each column (because.. what if I have 100 columns):
INSERT INTO new_table
SELECT sales_uk, sales_ca, sales_sp, sales_us, sales, metrics[0]:category::string FROM og_table
is there another way? I've tried this but it didn't work:
select io.metrics[0]:category::string as new_id, io.*
from og_table io

If I understand your question correctly (that you have a variant column which you want to structure into a column for every object key) and it is acceptable to use copy/paste, then there should be an acceptable approach like this using lateral flatten of the variant column to produce a list of copy/paste-able columns for your new table definition.
with og as (select array_construct(object_construct(
'category', 1,
'bar', 2,
'baz', 3
)) metrics)
select 'metrics'||metrics_array.path||':'||metrics_object.path||'::string as '||metrics_object.path||',' newcol from og
,lateral flatten(metrics) metrics_array
,lateral flatten(metrics_array.value) metrics_object;
--Produces output like:
metrics[0]:bar::string as bar,
metrics[0]:baz::string as baz,
metrics[0]:category::string as category,

Apply OPENJSON to a single column

I have a products table with two attribute column, and a json column. I'd like to be able to delimit the json column and insert extra rows retaining the attributes. Sample data looks like:
ID Name Attributes
1 Nikon {"4e7a":["jpg","bmp","nef"],"604e":["en"]}
2 Canon {"4e7a":["jpg","bmp"],"604e":["en","jp","de"]}
3 Olympus {"902c":["yes"], "4e7a":["jpg","bmp"]}
I understand OPENJSON can convert JSON objects into rows, and key values into cells but how do I apply it on a single column that contains JSON data?
My goal is to have an output like:
ID Name key value
1 Nikon 902c NULL
1 Nikon 4e7a ["jpg","bmp","nef"]
1 Nikon 604e ["en"]
2 Canon 902c NULL
2 Canon 4e7a ["jpg","bmp"]
2 Canon 604e ["en","jp","de"]
3 Olympus 902c ["yes"]
3 Olympus 4e7a ["jpg","bmp"]
3 Olympus 604e NULL
Is there a way I can query this products table like? Or is there a way to reproduce my goal data set?
SELECT
ID,
Name,
OPENJSON(Attributes)
FROM products
Thanks!

Here is something that will at least start you in the right direction.
SELECT P.ID, P.[Name], AttsData.[key], AttsData.[Value]
FROM products P CROSS APPLY OPENJSON (P.Attributes) AS AttsData
The one thing that has me stuck a bit right now is the missing values (value is null in result)...
I was thinking of maybe doing some sort of outer/full join back to this, but even that is giving me headaches. Are you certain you need that? Or, could you do an existence check with the output from the SQL above?
I am going to keep at this. If I find a solution that matches your output exactly, I will add to this answer.
Until then... good luck!

You can get the rows with NULL value fields by creating a list of possible keys and using CROSS APPLY to associate each key to each row from the original dataset, and then left-joining in the parsed JSON.
Here's a working example you should be able to execute as-is:
-- Throw together a quick and dirty CTE containing your example data
WITH OriginalValues AS (
SELECT *
FROM (
VALUES ( 1, 'Nikon', '{"4e7a":["jpg","bmp","nef"],"604e":["en"]}' ),
( 2, 'Canon', '{"4e7a":["jpg","bmp"],"604e":["en","jp","de"]}' ),
( 3, 'Olympus', '{"902c":["yes"], "4e7a":["jpg","bmp"]}' )
) AS T ( ID, Name, Attributes )
),
-- Build a separate dataset that includes all possible 'key' values from the JSON.
PossibleKeys AS (
SELECT DISTINCT A.[key]
FROM OriginalValues CROSS APPLY OPENJSON( OriginalValues.Attributes ) AS A
),
-- Get the existing keys and values from the JSON, associated with the record ID
ValuesWithKeys AS (
SELECT OriginalValues.ID, Atts.[key], Atts.Value
FROM OriginalValues CROSS APPLY OPENJSON( OriginalValues.Attributes ) AS Atts
)
-- Join each possible 'key' value with every record in the original dataset, and
-- then left join the parsed JSON values for each ID and key
SELECT OriginalValues.ID, OriginalValues.Name, KeyList.[key], ValuesWithKeys.Value
FROM OriginalValues
CROSS APPLY PossibleKeys AS KeyList
LEFT JOIN ValuesWithKeys
ON OriginalValues.ID = ValuesWithKeys.ID
AND KeyList.[key] = ValuesWithKeys.[key]
ORDER BY ID, [key];
If you need to include some pre-determined key values where some of them might not exist in ANY of the JSON values stored in Attributes, you could construct a CTE (like I did to emulate your original dataset) or a temp table to provide those values instead of doing the DISTINCT selection in the PossibleKeys CTE above. If you already know what your possible key values are without having to query them out of the JSON, that would most likely be a less costly approach.

SQL Server FullText Search with Weighted Columns from Previous One Column

In the database on which I am attempting to create a FullText Search I need to construct a table with its column names coming from one column in a previous table. In my current implementation attempt the FullText indexing is completed on the first table Data and the search for the phrase is done there, then the second table with the search results is made.
The schema for the database is
**Players**
Id
PlayerName
Blacklisted
...
**Details**
Id
Name -> FirstName, LastName, Team, Substitute, ...
...
**Data**
Id
DetailId
PlayerId
Content
DetailId in the table Data relates to Id in Details, and PlayerId relates to Id in Players. If there are 1k rows in Players and 20 rows in Details, then there are 20k rows in Data.
WITH RankedPlayers AS
(
SELECT PlayerID, SUM(KT.[RANK]) AS Rnk
FROM Data c
INNER JOIN FREETEXTTABLE(dbo.Data, Content, '"Some phrase like team name and player name"')
AS KT ON c. DataID = KT.[KEY]
GROUP BY c.PlayerID
)
…
Then a table is made by selecting the rows in one column. Similar to a pivot.
…
SELECT rc.Rnk,
c.PlayerID,
PlayerName,
TeamID,
…
(SELECT Content FROM dbo.Data data WHERE DetailID = 1 AND data.PlayerID = c.PlayerID) AS [TeamName],
…
FROM dbo.Players c
JOIN RankedPlayers rc ON c. PlayerID = rc. PlayerID
ORDER BY rc.Rnk DESC
I can return a ranked table with this implementation, the aim however is to be able to produce results from weighted columns, so say the column Playername contributes to the rank more than say TeamName.
I have tried making a schema bound view with a pivot, but then I cannot index it because of the pivot. I have tried making a view of that view, but it seems the metadata is inherited, plus that feels like a clunky method.
I then tried to do it as a straight query using sub queries in the select statement, but cannot due to indexing not liking sub queries.
I then tried to join multiple times, again the index on the view doesn't like self-referencing joins.
How to do this?
I have come across this article http://developmentnow.com/2006/08/07/weighted-columns-in-sql-server-2005-full-text-search/ , and other articles here on weighted columns, however nothing as far as I can find addresses weighting columns when the columns were initially row data.

A simple solution that works really well. Put weight on the rows containing the required IDs in another table, left join that table to the table to which the full text search had been applied, and multiply the rank by the weight. Continue as previously implemented.
In code that comes out as
DECLARE #Weight TABLE
(
DetailID INT,
[Weight] FLOAT
);
INSERT INTO #Weight VALUES
(1, 0.80),
(2, 0.80),
(3, 0.50);
WITH RankedPlayers AS
(
SELECT PlayerID, SUM(KT.[RANK] * ISNULL(cw.[Weight], 0.10)) AS Rnk
FROM Data c
INNER JOIN FREETEXTTABLE(dbo.Data, Content, 'Karl Kognition C404') AS KT ON c.DataID = KT.[KEY]
LEFT JOIN #Weight cw ON c.DetailID = cw.DetailID
GROUP BY c.PlayerID
)
SELECT rc.Rnk,
...
I'm using a temporary table here for evidence of concept. I am considering adding a column Weights to the table Details to avoid an unnecessary table and left join.

CakePHP 3 left join and union in same query

I have a products table and a metadata table that I want to allow searching of. Having managed to write the query so it works super fast, I'm having trouble migrating it to CakePHP 3.x
The tables are standard parent->child setup with foreign keys and fulltext indexes on the relevant data fields.
The query I want to emulate is:
select products.*, sum(hits.relevance) as relevance from (
SELECT products.id, MATCH(products.code, products.title) AGAINST('"mm mmm"' IN BOOLEAN MODE) as relevance
FROM products
WHERE MATCH(products.code, products.title) AGAINST('"mm mmm"' IN BOOLEAN MODE)
union all
SELECT pim1.product_id as id, MATCH(pim1.value) AGAINST('"mm mmm"' IN BOOLEAN MODE) as relevance
FROM pim1
WHERE MATCH(pim1.value) AGAINST('"mm mmm"' IN BOOLEAN MODE)
) as hits
left join products on products.id = hits.id
group by products.id
order by relevance desc
Essentially this is allowing MySQL to use the indexes much faster than a left join does, unions the results, then uses that as the primary table to left join the products data to to hand off to paginate() and the view.
I have the unions all sorted, but I can't seem to get the outer query to work.
$pim = $this->Products
->association("Pim1")
->find('all')
->select(['fk' => 'product_id'])
->select(['relevance' => 'MATCH(value) AGAINST(:search IN BOOLEAN MODE)'])
->where("MATCH(value) AGAINST(:search IN BOOLEAN MODE)")
->bind(":search", $this->request->session()->read('search.wild_terms'));
$prd = $this->Products
->find('all')
->select(['fk' => 'id'])
->select(['relevance' => 'MATCH(code, title) AGAINST(:search IN BOOLEAN MODE)'])
->where("MATCH(code, title) AGAINST(:search IN BOOLEAN MODE)")
->bind(":search", $this->request->session()->read('search.wild_terms'));
This bit works and runs the union as per the subquery above
$query = $prd->unionAll($pim);
This bit doesn't then allow me to attach the products data to the results of that union
$query->leftJoinWith("Products", function ($q) { return $q->where(['Products.id' => 'fk']); });
It throws an error
Products is not associated with Products
Any guidance on how to convert my successful SQL into Cake would be greatly appreciated.

leftJoinWith() is used for joining associations. Since your Products table is not associated with itself you cannot use it. Instead use leftJoin() You will need to pass all the information to that method to build the join conditions.

Type of Indexing suitable for a view

I have a view created as follows :-
CREATE VIEW [dbo].[vwNumberOfEditsForTimeSheets]
AS
SELECT TOP (100) PERCENT TimeSheetId, COUNT(TimeSheetId) AS NumberOfEdits
FROM dbo.TimeSheetLogs AS tsl
WHERE (StatusId = 27)
GROUP BY TimeSheetId
ORDER BY TimeSheetId
It probably has about 1,00,000 entries now and would increase by around 500 to 1000 each day.
Which type of indexing would be best for this type of view?
Thanks

You could in fact use an indexed view here
CREATE VIEW [dbo].[vwNumberOfEditsForTimeSheets]
WITH SCHEMABINDING
AS
SELECT TimeSheetId,
COUNT_BIG(*) AS NumberOfEdits
FROM dbo.TimeSheetLogs AS tsl
WHERE ( StatusId = 27 )
GROUP BY TimeSheetId
GO
CREATE UNIQUE CLUSTERED INDEX IX
ON [dbo].[vwNumberOfEditsForTimeSheets](TimeSheetId)
Dependant on edition you might need to use the NOEXPAND hint to get it to be used.
If you don't want to use an indexed view then the optimum index on the base table to support that SELECT query would be
CREATE NONCLUSTERED INDEX IX ON dbo.TimeSheetLogs(StatusId, TimeSheetId)
to alllow the lookup by StatusId with matching rows ordered by TimeSheetId and thus easily grouped and counted by a stream aggregate.

Note that indexed views make some of your queries faster, but they require more maintenance when you are updating (insert, update, delete, merge) the base table. You really should check if this view is worth what you pay for it by performing tests against your full workload, not just the selects against the view.
CREATE VIEW [dbo].[vwNumberOfEditsForTimeSheets]
WITH SCHEMABINDING
AS
SELECT TimeSheetId, COUNT_BIG(*) AS NumberOfEdits
FROM dbo.TimeSheetLogs AS tsl
WHERE (StatusId = 27)
GROUP BY TimeSheetId;
GO
CREATE UNIQUE CLUSTERED INDEX TSId
ON dbo.vwNumberOfEditsForTimeSheets(TimeSheetId);
GO
If you are on standard edition, you may have to use the WITH (NOEXPAND) hint to make use of the view.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to find column level lineage information with in the snowflake - snowflake-cloud-data-platform

I am trying to find column level lineage information with in the snowflake. Few blogs say's that we can build lineage from the data present in the Access_History view which is under Account Usage Schema but I could not find the relevant info from this view. Can anybody clarify on this.

Try the OBJECT_DEPENDANCIES view, documented here: https://docs.snowflake.com/en/sql-reference/account-usage/object_dependencies.html

Related

snowflake - select all columns without writing out each column and unnest one

Apply OPENJSON to a single column

SQL Server FullText Search with Weighted Columns from Previous One Column

CakePHP 3 left join and union in same query

Type of Indexing suitable for a view

Categories

Resources