Condition Insert based Fuzzy Logic in Snowflake - snowflake-cloud-data-platform

I have a table as below
Company Name
ID
Facebook
32
Google
33
Apple
44
So If I get a new record with Company name "Facebook Inc" or "Facebook Company" it should ignore else it should insert. What should be condition for logic?
insert into Table a where? (Fuzzy logic)

For the logic described in the question a simple way of solving this is with a merge:
Apply the "fuzzy logic" to look for a match. In this case it's a regex that compares the first word of the string: regexp_substr(a.company, '^[^ ]+') = regexp_substr(b.company, '^[^ ]+')
If matched, do nothing (with and false).
If not matched, insert:
merge into companies a
using (
select 'Facebook Inc' company, 10 id
) as b on regexp_substr(a.company, '^[^ ]+') = regexp_substr(b.company, '^[^ ]+')
when matched and false then update set a.id = b.id
when not matched then insert (company, id) values (b.company, b.id)
Setup:
create or replace temp table companies as
select $1::string company, $2::int id
from values ('Google', 1), ('Facebook', 2), ('Apple', 3);
If you want to define a more complex "fuzzy logic", please start a new question.

Related

Snowflake - how to do multiple DML operations on same primary key in a specific order?

I am trying to set up continuous data replication in Snowflake. I get the transactions happened in source system and I need to perform them in Snowflake in the same order as source system. I am trying to use MERGE for this, but when there are multiple operations on same key in source system, MERGE is not working correctly. It either misses an operation or returns duplicate row detected during DML operation error.
Please note that the transactions need to happen in exact order and it is not possible to take the latest transaction for a key and just do it (like if a record has been INSERTED and UPDATED, in Snowflake too it needs to be inserted first and then updated even though insert is only transient state) .
Here is the example:
create or replace table employee_source (
id int,
first_name varchar(255),
last_name varchar(255),
operation_name varchar(255),
binlogkey integer
)
create or replace table employee_destination ( id int, first_name varchar(255), last_name varchar(255) );
insert into employee_source values (1,'Wayne','Bells','INSERT',11);
insert into employee_source values (1,'Wayne','BellsT','UPDATE',12);
insert into employee_source values (2,'Anthony','Allen','INSERT',13);
insert into employee_source values (3,'Eric','Henderson','INSERT',14);
insert into employee_source values (4,'Jimmy','Smith','INSERT',15);
insert into employee_source values (1,'Wayne','Bellsa','UPDATE',16);
insert into employee_source values (1,'Wayner','Bellsat','UPDATE',17);
insert into employee_source values (2,'Anthony','Allen','DELETE',18);
MERGE into employee_destination as T using (select * from employee_source order by binlogkey)
AS S
ON T.id = s.id
when not matched
And S.operation_name = 'INSERT' THEN
INSERT (id,
first_name,
last_name)
VALUES (
S.id,
S.first_name,
S.last_name)
when matched AND S.operation_name = 'UPDATE'
THEN
update set T.first_name = S.first_name, T.last_name = S.last_name
When matched
And S.operation_name = 'DELETE' THEN DELETE;
I am expecting to see - Bellsat - as last name for employee id 1 in the employee_destination table after all rows get processed. Same way, I should not see emp id 2 in the employee_destination table.
Is there any other alternative to MERGE to achieve this? Basically to go over every single DML in the same order (using binlogkey column for ordering) .
thanks.
You need to manipulate your source data to ensure that you only have one record per key/operation otherwise the join will be non-deterministic and will (dpending on your settings) either error or will update using a random one of the applicable source records. This is covered in the documentation here https://docs.snowflake.com/en/sql-reference/sql/merge.html#duplicate-join-behavior.
In any case, why would you want to update a record only for it to be overwritten by another update - this would be incredibly inefficient?
Since your updates appear to include the new values for all rows, you can use a window function to get to just the latest incoming change, and then merge those results into the target table. For example, the select for that merge (with the window function to get only the latest change) would look like this:
with SOURCE_DATA as
(
select COLUMN1::int ID
,COLUMN2::string FIRST_NAME
,COLUMN3::string LAST_NAME
,COLUMN4::string OPERATION_NAME
,COLUMN5::int PROCESSING_ORDER
from values
(1,'Wayne','Bells','INSERT',11),
(1,'Wayne','BellsT','UPDATE',12),
(2,'Anthony','Allen','INSERT',13),
(3,'Eric','Henderson','INSERT',14),
(4,'Jimmy','Smith','INSERT',15),
(1,'Wayne','Bellsa','UPDATE',16),
(1,'Wayne','Bellsat','UPDATE',17),
(2,'Anthony','Allen','DELETE',18)
)
select * from SOURCE_DATA
qualify row_number() over (partition by ID order by PROCESSING_ORDER desc) = 1
That will produce a result set that has only the changes required to merge into the target table:
ID
FIRST_NAME
LAST_NAME
OPERATION_NAME
PROCESSING_ORDER
1
Wayne
Bellsat
UPDATE
17
2
Anthony
Allen
DELETE
18
3
Eric
Henderson
INSERT
14
4
Jimmy
Smith
INSERT
15
You can then change the when not matched to remove the operation_name. If it's listed as an update and it's not in the target table, it's because it was inserted in a previous operation in the new changes.
For the when matched clause, you can use the operation_name to determine if the row should be updated or deleted.

How to do Multiple Joins with Merge Into?

I have these 3 tables
Company
id
Branch
id
Items
id
StockNumber
Company can have many branches and a branch can have many items.
Now I got to write a query that will either insert or update an item depending on conditions.
Some items can only appear once in the company and some items can appear in each branch.
The problem for me is the ones that can only appear once in the company. I think I am going to need to basically join all these tables together and do a check but I don't know how to do this join in a "Merge Into Sp"
I made a table type that looks like this
CREATE TYPE ItemTableType AS TABLE
(
BranchId INT,
CompanyId INT
Description nvarchar(Max),
StockNumber: INT
);
In my code I can pass the companyId into my tabletype
CREATE PROCEDURE dbo.Usp_upsert #Source ItemTableType readonly
AS
MERGE INTO items AS Target
using #Source AS Source
ON
// need to somehow look at the companyId so I can then find the right record reguardlesss of which branch it sits in.
Targert.CompanyId = source.CompanyId // can't do this just like this as Item doesn not have reference to company table.
Target.StockNumber = source.StockNumber
WHEN matched THEN
// update
WHEN NOT matched BY target THEN
// insert
Edit
Sample Data
Company
Id Name
1 'A'
2 'B'
Branch
Id name CompanyId
1 'A.1' 1
2 'A.2' 1
3 'B.1' 2
4 'B.2' 3
Item
Id Name StockNumber BranchId
1 Wrench 12345 1
2 Wrench 12345 3
3 Hammer 484814 2
4 Hammer 85285825 4
Now a bulk data is going to be sent into this SP via C# code and looks something like this
DataTable myTable = ...;
// Define the INSERT-SELECT statement.
string sqlInsert = "dbo.usp_InsertTvp"
// Configure the command and parameter.
SqlCommand mergeCommand = new SqlCommand(sqlInsert, connection);
mergeCommand.CommandType = CommandType.StoredProcedure;
SqlParameter tvpParam = mergeCommand.Parameters.AddWithValue("#Source", myTable);
tvpParam.SqlDbType = SqlDbType.Structured;
tvpParam.TypeName = "dbo.SourceTableType";
// Execute the command.
insertCommand.ExecuteNonQuery();
Now say when an import of records come in and the data looks like this
Wrench (Name), 12345 (StockNumber), 2 (BranchId..they are switching the branch of this item to another branch)
If I would just send this in then if I used BranchId + Stocknumber nothing would be updated and a new record would be inserted what would be wrong as now 2 branches have the same item(based on stockNumber)
If I would just use StockNumber then these 2 records would be updated.
1 Wrench 12345 1
2 Wrench 12345 3
Which is wrong as these records are from 2 different companies. Thus I need to also use the companyId, thus I need to also check the companyId.
EDIT (from comments):
I think I have to do Target Dot something. This is what I came up with so far:
MERGE INTO Items AS Target
using #Source AS Source
ON Source.CompanyID=(
SELECT TOP 1 Companies.Id
FROM Branches
INNER JOIN Companies
ON Branches.CompanyId = Companies.Id
INNER JOIN InventoryItems
ON Branches.Id = Target.BranchId
where Companies.Id = Source.CompanyId
and StockNumber = Source.StockNumber
)
The description of what you need to do is too vague for me to be specific, but you can simply do a query with JOINs as your source. I like to put it in a CTE to make it pretty like so:
WITH cte AS (SELECT query with JOINS)
MERGE INTO items AS Target
using cte AS Source
ON
EDIT: To also do a JOIN on the Target (items) you need to do it in the ON conditions:
WITH cte AS (SELECT query with JOINS)
MERGE INTO items AS Target
using cte AS Source
ON Source.CompanyID=(
SELECT TOP 1 CompanyId
FROM TableWithCompanyId
JOIN Target
ON JoinCondition=true
)...
I know yours involves two tables to get from items to company, but the example above shows you the technique that I believe you are missing.
EDIT 2, based on latest attempt:
Try it this way:
MERGE INTO Items AS Target
using #Source AS Source
ON Source.CompanyID=(
SELECT TOP 1 Companies.Id
FROM Branches
INNER JOIN Companies
ON Branches.CompanyId = Companies.Id
WHERE Branches.Id = Target.BranchId
)
and Target.StockNumber = Source.StockNumber

SQL join conditional either or not both?

I have 3 tables that I'm joining and 2 variables that I'm using in one of the joins.
What I'm trying to do is figure out how to join based on either of the statements but not both.
Here's the current query:
SELECT DISTINCT
WR.Id,
CAL.Id as 'CalendarId',
T.[First Of Month],
T.[Last of Month],
WR.Supervisor,
WR.cd_Manager as [Manager], --Added to search by the Manager--
WR.[Shift] as 'ShiftId'
INTO #Workers
FROM #T T
--Calendar
RIGHT JOIN [dbo].[Calendar] CAL
ON CAL.StartDate <= T.[Last of Month]
AND CAL.EndDate >= T.[First of Month]
--Workers
--This is the problem join
RIGHT JOIN [dbo].[Worker_Filtered]WR
ON WR.Supervisor IN (SELECT Id FROM [dbo].[User] WHERE FullName IN(#Supervisors))
or (WR.Supervisor IN (SELECT Id FROM [dbo].[User] WHERE FullName IN(#Supervisors))
AND WR.cd_Manager IN(SELECT Id FROM [dbo].[User] WHERE FullNameIN(#Manager))) --Added to search by the Manager--
AND WR.[Type] = '333E7907-EB80-4021-8CDB-5380F0EC89FF' --internal
WHERE CAL.Id = WR.Calendar
AND WR.[Shift] IS NOT NULL
What I want to do is either have the result based on the Worker_Filtered table matching the #Supervisor or (but not both) have it matching both the #Supervisor and #Manager.
The way it is now if it matches either condition it will be returned. This should be limiting the returned results to Workers that have both the Supervisor and Manager which would be a smaller data set than if they only match the Supervisor.
UPDATE
The query that I have above is part of a greater whole that pulls data for a supervisor's workers.
I want to also limit it to managers that are under a particular supervisor.
For example, if #Supervisor = John Doe and #Manager = Jane Doe and John has 9 workers 8 of which are under Jane's management then I would expect the end result to show that there are only 8 workers for each month. With the current query, it is still showing all 9 for each month.
If I change part of the RIGHT JOIN to:
WR.Supervisor IN (SELECT Id FROM [dbo].[User] WHERE FullName IN (#Supervisors))
AND WR.cd_Manager IN(SELECT Id FROM [dbo].[User] WHERE FullName IN(#Manager))
Then it just returns 12 rows of NULL.
UPDATE 2
Sorry, this has taken so long to get a sample up. I could not get SQL Fiddle to work for SQL Server 2008/2014 so I am using rextester instead:
Sample
This shows the results as 108 lines. But what I want to show is just the first 96 lines.
UPDATE 3
I have made a slight update to the Sample. this does get the results that I want. I can set #Manager to NULL and it will pull all 108 records, or I can have the correct Manager name in there and it'll only pull those that match both Supervisor and Manager.
However, I'm doing this with an IF ELSE and I was hoping to avoid doing that as it duplicates code for the insert into the Worker table.
The description of expected results in update 3 makes it all clear now, thanks. Your 'problem' join needs to be:
RIGHT JOIN Worker_Filtered wr on (wr.Supervisor in(#Supervisors)
and case when #Manager is null then 1
else case when wr.Manager in(#Manager) then 1 else 0 end
end = 1)
By the way, I don't know what you are expecting the in(#Supervisors) to achieve, but if you're hoping to supply a comma separated list of supervisors as a single string and have wr.Supervisor match any one of them then you're going to be disappointed. This query works exactly the same if you have = #Supervisors instead.

Apply OPENJSON to a single column

I have a products table with two attribute column, and a json column. I'd like to be able to delimit the json column and insert extra rows retaining the attributes. Sample data looks like:
ID Name Attributes
1 Nikon {"4e7a":["jpg","bmp","nef"],"604e":["en"]}
2 Canon {"4e7a":["jpg","bmp"],"604e":["en","jp","de"]}
3 Olympus {"902c":["yes"], "4e7a":["jpg","bmp"]}
I understand OPENJSON can convert JSON objects into rows, and key values into cells but how do I apply it on a single column that contains JSON data?
My goal is to have an output like:
ID Name key value
1 Nikon 902c NULL
1 Nikon 4e7a ["jpg","bmp","nef"]
1 Nikon 604e ["en"]
2 Canon 902c NULL
2 Canon 4e7a ["jpg","bmp"]
2 Canon 604e ["en","jp","de"]
3 Olympus 902c ["yes"]
3 Olympus 4e7a ["jpg","bmp"]
3 Olympus 604e NULL
Is there a way I can query this products table like? Or is there a way to reproduce my goal data set?
SELECT
ID,
Name,
OPENJSON(Attributes)
FROM products
Thanks!
Here is something that will at least start you in the right direction.
SELECT P.ID, P.[Name], AttsData.[key], AttsData.[Value]
FROM products P CROSS APPLY OPENJSON (P.Attributes) AS AttsData
The one thing that has me stuck a bit right now is the missing values (value is null in result)...
I was thinking of maybe doing some sort of outer/full join back to this, but even that is giving me headaches. Are you certain you need that? Or, could you do an existence check with the output from the SQL above?
I am going to keep at this. If I find a solution that matches your output exactly, I will add to this answer.
Until then... good luck!
You can get the rows with NULL value fields by creating a list of possible keys and using CROSS APPLY to associate each key to each row from the original dataset, and then left-joining in the parsed JSON.
Here's a working example you should be able to execute as-is:
-- Throw together a quick and dirty CTE containing your example data
WITH OriginalValues AS (
SELECT *
FROM (
VALUES ( 1, 'Nikon', '{"4e7a":["jpg","bmp","nef"],"604e":["en"]}' ),
( 2, 'Canon', '{"4e7a":["jpg","bmp"],"604e":["en","jp","de"]}' ),
( 3, 'Olympus', '{"902c":["yes"], "4e7a":["jpg","bmp"]}' )
) AS T ( ID, Name, Attributes )
),
-- Build a separate dataset that includes all possible 'key' values from the JSON.
PossibleKeys AS (
SELECT DISTINCT A.[key]
FROM OriginalValues CROSS APPLY OPENJSON( OriginalValues.Attributes ) AS A
),
-- Get the existing keys and values from the JSON, associated with the record ID
ValuesWithKeys AS (
SELECT OriginalValues.ID, Atts.[key], Atts.Value
FROM OriginalValues CROSS APPLY OPENJSON( OriginalValues.Attributes ) AS Atts
)
-- Join each possible 'key' value with every record in the original dataset, and
-- then left join the parsed JSON values for each ID and key
SELECT OriginalValues.ID, OriginalValues.Name, KeyList.[key], ValuesWithKeys.Value
FROM OriginalValues
CROSS APPLY PossibleKeys AS KeyList
LEFT JOIN ValuesWithKeys
ON OriginalValues.ID = ValuesWithKeys.ID
AND KeyList.[key] = ValuesWithKeys.[key]
ORDER BY ID, [key];
If you need to include some pre-determined key values where some of them might not exist in ANY of the JSON values stored in Attributes, you could construct a CTE (like I did to emulate your original dataset) or a temp table to provide those values instead of doing the DISTINCT selection in the PossibleKeys CTE above. If you already know what your possible key values are without having to query them out of the JSON, that would most likely be a less costly approach.

SQL Server FullText Search with Weighted Columns from Previous One Column

In the database on which I am attempting to create a FullText Search I need to construct a table with its column names coming from one column in a previous table. In my current implementation attempt the FullText indexing is completed on the first table Data and the search for the phrase is done there, then the second table with the search results is made.
The schema for the database is
**Players**
Id
PlayerName
Blacklisted
...
**Details**
Id
Name -> FirstName, LastName, Team, Substitute, ...
...
**Data**
Id
DetailId
PlayerId
Content
DetailId in the table Data relates to Id in Details, and PlayerId relates to Id in Players. If there are 1k rows in Players and 20 rows in Details, then there are 20k rows in Data.
WITH RankedPlayers AS
(
SELECT PlayerID, SUM(KT.[RANK]) AS Rnk
FROM Data c
INNER JOIN FREETEXTTABLE(dbo.Data, Content, '"Some phrase like team name and player name"')
AS KT ON c. DataID = KT.[KEY]
GROUP BY c.PlayerID
)
…
Then a table is made by selecting the rows in one column. Similar to a pivot.
…
SELECT rc.Rnk,
c.PlayerID,
PlayerName,
TeamID,
…
(SELECT Content FROM dbo.Data data WHERE DetailID = 1 AND data.PlayerID = c.PlayerID) AS [TeamName],
…
FROM dbo.Players c
JOIN RankedPlayers rc ON c. PlayerID = rc. PlayerID
ORDER BY rc.Rnk DESC
I can return a ranked table with this implementation, the aim however is to be able to produce results from weighted columns, so say the column Playername contributes to the rank more than say TeamName.
I have tried making a schema bound view with a pivot, but then I cannot index it because of the pivot. I have tried making a view of that view, but it seems the metadata is inherited, plus that feels like a clunky method.
I then tried to do it as a straight query using sub queries in the select statement, but cannot due to indexing not liking sub queries.
I then tried to join multiple times, again the index on the view doesn't like self-referencing joins.
How to do this?
I have come across this article http://developmentnow.com/2006/08/07/weighted-columns-in-sql-server-2005-full-text-search/ , and other articles here on weighted columns, however nothing as far as I can find addresses weighting columns when the columns were initially row data.
A simple solution that works really well. Put weight on the rows containing the required IDs in another table, left join that table to the table to which the full text search had been applied, and multiply the rank by the weight. Continue as previously implemented.
In code that comes out as
DECLARE #Weight TABLE
(
DetailID INT,
[Weight] FLOAT
);
INSERT INTO #Weight VALUES
(1, 0.80),
(2, 0.80),
(3, 0.50);
WITH RankedPlayers AS
(
SELECT PlayerID, SUM(KT.[RANK] * ISNULL(cw.[Weight], 0.10)) AS Rnk
FROM Data c
INNER JOIN FREETEXTTABLE(dbo.Data, Content, 'Karl Kognition C404') AS KT ON c.DataID = KT.[KEY]
LEFT JOIN #Weight cw ON c.DetailID = cw.DetailID
GROUP BY c.PlayerID
)
SELECT rc.Rnk,
...
I'm using a temporary table here for evidence of concept. I am considering adding a column Weights to the table Details to avoid an unnecessary table and left join.

Resources