dbt incremental model in SQL Server - sql-server

From this model, and according to my understanding of the documentation of dbt:
EDIT: (removed the DISTINCT in the statement as it was unnecessary)
Test.sql
{{ config(
as_columnstore = false,
schema='staging',
materialized='incremental',
unique_key='id',
incremental_strategy='merge',
merge_update_columns = ['name', 'updated_at'])
}}
SELECT I.id,
I.name,
MAX(I.extraction_date) created_at,
MAX(I.extraction_date) updated_at
FROM staging.test_data_raw I
WHERE I.id IS NOT NULL
GROUP BY I.id, I.name
I expected that, if a record with a matching Id already existed, in the table then the name and the updated_at would change but the created_at would remain as it was
But, after running it several times, the created_at always changes. So my guess is that dbt is not performing a merge operation but a delete/insert.
I am running dbt with the SQL Server connector.
Is it possible that this connector does not implement the merge strategy?
Or am I doing something wrong here? And if so, is there any way to solve this?

I am sitting here and wondering about the same thing (but with Azure DW). No positive response in 15 days, so I guess the answer must be that the incremental strategy "merge" has not yet been implemented in the dbt-sqlserver adapter. Go to https://github.com/dbt-msft/dbt-sqlserver/discussions/categories/ideas and suggest this feature!

Related

Why does the polybase pushdown filter for join not work?

Recently put polybase into production and have noticed that pushdown isn't working for joins. Take the following query:
CREATE TABLE #VendIDs
(VendorAN int primary key)
INSERT INTO #VendIDs(VendorAN)
VALUES (1),(89980),(89993),(90002),(90003),(90008),(90004),(90015),(90018),(97140),(97139),(97138)
,(97137),(97136),(97135),(97134),(97059),(97058),(97057),(97056),(97055),(97054),(97053),(97052)
SELECT VW.VendorAN, [Type], Code, [Name],Address1, Address2, City,State, Zip,Country, Phone,
ShipAddress1, ShipAddress2, ShipCity, ShipState, ShipZip,Latitude, Longitude
FROM vwVendor VW
JOIN #VendIDs FV ON VW.VendorAN = FV.VendorAN
The execution plan shows 22k rows from the 'remote query', which just happens to match the number of rows in the external table. After enabling trace flag 6408, it shows 22k records on the external side.
If I do a simply where vendorAN = XXXXXX, I can clearly see 1 row being returned via the remote query and the filtering be done on the remote side.
Anyone have a thought on why I'm not seeing pushdown filtering on the join as shown above? Makes no sense to me based upon what I've read to date.
Referencing this article for how to know if pushdown filtering is occurring: https://learn.microsoft.com/en-us/sql/relational-databases/polybase/polybase-how-to-tell-pushdown-computation?view=sql-server-ver15#--use-tf6408
Is your external table created with PUSHDOWN=ON or your query uses OPTION(ENABLE EXTERNALPUSHDOWN)?
Have you created statistics from the external table?
Is the remote table partitioned?
How is vwVendor created? Is it a view on the external table joined to other tables?
You also need to take a look at sys.dm_exec_distributed_request_steps and sys.dm_exec_distributed_sql_requests to see what is occurring under the hood.

selecting all ids matching highest version of files from another table

I'm writing simple backup program using sqlite using this layout:
Each file is is identified by unique hash and it has multiple asociated records in file_version. When snapshot of database is created, most current file_versions are asociated with it via snapshot_file.
Example:
file (hash,path)
abc|/img.png
bcd|/img.jpeg
file_version (id,mtime,md5,hash)
1|1000|md5aoeu|abc
2|1500|md5bcda|abc
3|2500|md5asdf|abc
4|2500|md5aoaa|bcd
snapshot (time, description)
1250| 'first snapshot'
2000| 'second snapshot'
3000| 'third snapshot'
When I'm trying to create new snapshot, I need to query newest file_versions for each file and add appropriate records into snapshot_file.
So If I were to create new snapshot, I would need id of newest file version of file with hash 'abc' (matching file /img.png).
So expected return of select for this query is:
3|2500|abc
4|2500|bcd
Sorry, my english is pretty bad (title might be confusing), if you need further clarification, please lemme know.
Thanks in advance.
This is similar to:
How can I select all entries with the highest version?
however it's slightly more complicated than that (since there can be only one id per each file).
I would try something like this:
SELECT i.*
FROM file_versions i
INNER JOIN (
SELECT
hash,
MAX(mtime) AS latestTime
FROM file_versions
GROUP BY hash
)latest ON i.mtime = latest.latestTime
and i.hash = latest.hash
EDIT
Based on the OP's comment, I would change the code to use a CTE
WITH latest_CTE AS (hash, latestTime)
SELECT
hash,
MAX(mtime) AS latestTime
FROM file_versions
GROUP BY hash
)
SELECT i.* FROM file_version i
JOIN latest_CTE c on i.mtime = c.latestTime
AND i.hash = c.hash
Common Table Expressions will give you improved performance even across millions of records. Please ensure that you have the right indexes on your table(s) though

MSSql Batch Update of table on Primary Keys

I have migrated an Access DB to MSSql server 2008 and found some anomalies from the old database. On both DBs IDs are auto incremental and should be in line with Date. But as shown below, some have been saved in the wrong chronological order.
**Access:**
ID FileID DateOfTransaction SectionID
64490 95900 02/12/1997 100
64491 95900 04/04/1996 80
64492 95900 25/03/1996 90
**Desired Correct Format:**
ID FileID DateOfTransaction SectionID
64492 95900 02/12/1997 100
64491 95900 04/04/1996 80
64490 95900 25/03/1996 90
The PK (ID) table is linked to several other tables with update Cascade set.
I need to group by FileID and sort by DateOfTransaction and update IDs accordingly.
I need some suggestions on how best to tackle this as data is quite sensitive. I have about 50K records to update.
Thanks for reading!
try this query
with cte as
(select * from (
select *,ROW_NUMBER() over (partition by FileID
order by DateOfTransaction) as row_num
from t_Transactions) A
join
(select ID B_ID, FileID B_FileID,ROW_NUMBER()
over (partition by FileID order by ID) as B_row_num
from t_Transactions) B
on A.row_num=B.B_row_num)
select T.ID [Old_ID], CTE.B_ID [New_ID],
T.FileID,T.DateOfTransaction,T.SectionID
--update T set T.ID=CTE.B_ID
from t_Transactions T join cte
on T.ID=CTE.ID
and CTE.B_FileID=T.FileID
Before updating , you can select and conform the result
This query updates the table as per your requirement. You have mentioned that ID column is linked to several other tables. Please be very careful about this and make sure that updating ID column doesn't break anything else
SQL Fiddle Demo
Designing a database to rely on the order of an artificially-generated key to match the date order of another column is a terrible anti-pattern, NOT best practice in the slightest.
Stop relying on it to represent insertion order. That is the answer. If you need that data, it should be another column separate from your PK. Can't you order by date, anyway? If not, create a new column.
It is always a mistake to invest internal database identifiers with meaning of any kind besides relating rows to each other.
I've seen this exact problem before at a former employer--and the database was rife with all sorts of other design problems as well. FK columns were actually named "frnkeyColumnName" to match the "keyColumnName" they pointed to. Never mind a PK that was also an FK...
Stop the madness!
I would seriously consider whether you need to do this at all. Is there any logic that depends on higher IDs having a later date? Was the data out of order in the Access database, in which case, it doesn't matter.
If you do decide to proceed, back up the data first. You're probably going to make mistakes.

When to use with clause in sql

Can Anybody tell me when to use with clause.
The WITH keyword is used to create a temporary named result set. These are called Common Table Expressions.
A very basic, self-explanatory example:
WITH Administrators (Name, Surname)
AS
(
SELECT Name, Surname FROM Users WHERE AccessRights = 'Admin'
)
SELECT * FROM Administrators
For further reading and more examples, I suggest starting out with the following MSDN article:
Common Table Expressions by John Papa
In SQL Server you sometimes need the WITH clause to force a query to use an Index. This is often a necessity in spatial queries that can reduce query time from 1 minute to a few seconds.
select * from MyTable with(index(MySpatialIndex)) where...

joining latest of various usermetadata tags to user rows

I have a postgres database with a user table (userid, firstname, lastname) and a usermetadata table (userid, code, content, created datetime). I store various information about each user in the usermetadata table by code and keep a full history. so for example, a user (userid 15) has the following metadata:
15, 'QHS', '20', '2008-08-24 13:36:33.465567-04'
15, 'QHE', '8', '2008-08-24 12:07:08.660519-04'
15, 'QHS', '21', '2008-08-24 09:44:44.39354-04'
15, 'QHE', '10', '2008-08-24 08:47:57.672058-04'
I need to fetch a list of all my users and the most recent value of each of various usermetadata codes. I did this programmatically and it was, of course godawful slow. The best I could figure out to do it in SQL was to join sub-selects, which were also slow and I had to do one for each code.
This is actually not that hard to do in PostgreSQL because it has the "DISTINCT ON" clause in its SELECT syntax (DISTINCT ON isn't standard SQL).
SELECT DISTINCT ON (code) code, content, createtime
FROM metatable
WHERE userid = 15
ORDER BY code, createtime DESC;
That will limit the returned results to the first result per unique code, and if you sort the results by the create time descending, you'll get the newest of each.
I suppose you're not willing to modify your schema, so I'm afraid my answe might not be of much help, but here goes...
One possible solution would be to have the time field empty until it was replaced by a newer value, when you insert the 'deprecation date' instead. Another way is to expand the table with an 'active' column, but that would introduce some redundancy.
The classic solution would be to have both 'Valid-From' and 'Valid-To' fields where the 'Valid-To' fields are blank until some other entry becomes valid. This can be handled easily by using triggers or similar. Using constraints to make sure there is only one item of each type that is valid will ensure data integrity.
Common to these is that there is a single way of determining the set of current fields. You'd simply select all entries with the active user and a NULL 'Valid-To' or 'deprecation date' or a true 'active'.
You might be interested in taking a look at the Wikipedia entry on temporal databases and the article A consensus glossary of temporal database concepts.
A subselect is the standard way of doing this sort of thing. You just need a Unique Constraint on UserId, Code, and Date - and then you can run the following:
SELECT *
FROM Table
JOIN (
SELECT UserId, Code, MAX(Date) as LastDate
FROM Table
GROUP BY UserId, Code
) as Latest ON
Table.UserId = Latest.UserId
AND Table.Code = Latest.Code
AND Table.Date = Latest.Date
WHERE
UserId = #userId

Resources