How to join query_id & METADATA$ROW_ID in SnowFlake - snowflake-cloud-data-platform

I am working on tracking the changes in data along with few audit details like user who made the changes.
Streams in Snowflake gives delta records details and few audit columns including METADATA$ROW_ID.
Another table i.e. information_schema.query_history contain query history details including query_id, user_name, DB name, schema name etc.
I am looking for a way so that I can join query_id & METADATA$ROW_ID so that I can find the user_name corresponding to each change in data.
any lead will be much appreciated.
Regards,
Neeraj

The METADATA$ROW_ID column in a stream uniquely identifies each row in the source table so that you can track its changes using the stream.
It isn't there to track who changed the data, rather it is used to track how the data changed.
To my knowledge Snowflake doesn't track who changed individual rows, this is something you would have to build into your application yourself - by having a column like updated_by for example.

Only way i have found is to add
SELECT * FROM table(information_schema.QUERY_HISTORY_BY_SESSION()) ORDER BY start_time DESC LIMIT 1
during reports / table / row generation
Assuming that you have not changed setting that you can run more queries at same time in one session , that gets running querys id's , change it to CTE and do cross join to in last part of select to insert it to all rows.
This way you get all variables in query_history table. Also remember that snowflake does keep SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY ( and other data ) up to one year. So i recommend weekly/monthly job which merges data into long term history table. That way you an also handle access to history data much more easier that giving accountadmin role to users.

Related

How to use dbt's incremental model without duplicates

I have a query that selects some data that I would like to use to create an incremental table. Something like:
{{
config(
materialized='incremental',
unique_key='customer_id'
)
}}
SELECT
customer_id,
email,
updated_at,
first_name,
last_name
FROM data
The input data has duplicate customers in it. If I read the documentation correctly, then records with the same unique_key should be seen as the same record. They should be updated instead of creating duplicates in the final table. However, I am seeing duplicates in the final table instead. What am I doing wrong?
I am using Snowflake as a datawarehouse.
If your source table already contains the duplicate, this is the regular behavior.
As per dbt documentation: "The first time a model is run, the table is built by transforming all rows of source data."
Docs: https://docs.getdbt.com/docs/build/incremental-models
This means basically that the duplicates will be avoided in all future loads, but not during the initial creation. Hence you need to change your SELECT statement so that duplicates are somehow filtered out in the creation itself.
With incremental materialization dbt would do a merge or delete-insert using the unique_key. I believe in the case of Snowflake it's doing a merge. This means that running the same model several times won't write the same records over and over again to the target table.
If you experience duplicates, most likely your select returns duplicate records. You'd need to deduplicate your input data, which is often done with row_number() function, something like:
(ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY {{ timestamp_column }} DESC) = 1) AS is_latest
and then filtering by WHERE is_latest

SQL Server: Best way to automatically stream updates to a summary table based on changes from another table

I am currently working in a SQL server database where I have a table User that has a schema like so:
username
category
user1
gaming
user2
gaming
user3
sports
My summary table UserCategoryCount is a simple groupby statement for how many users belong to each category and looks like this:
category
numUsers
gaming
2
sports
1
New entries are constantly being uploaded to the User, and I want to be able to stream updates in the User table to the UserCategoryCount summary table. I am aware that I can create a simple VIEW statement that performs a groupby on the User table, but I would like UserCategoryCount to be its own table that automatically changes based on new users being uploaded to the User table.
My first thought was to create a trigger that will detect when the User table has been updated. So far, the most simple but cheesy solution I can think of is creating a trigger that simply deletes and refreshes UserCategoryCount:
CREATE TRIGGER TRG_Add_User
ON User
AS
BEGIN
DELETE FROM UserCategoryCount
INSERT INTO UserCategoryCount (category, numUsers)
SELECT Category, Count(Category) as numUsers
FROM User GROUP BY Category
END
GO
But this seems like a really hacky way of updating the UserCategoryCount table. Any help on how to improve this update statement so that I don't have to completely overwrite the table every time a new user or batch of users has been inserted would be greatly appreciated.
For a start, your trigger is seriously flawed: it does not use the inserted or deleted tables and instead recalculates the whole thing every time, this is going to be very bad for performance. It also does not specify whether it is for inserts, updates or deletes.
A much better solution is to use an indexed view. This is like a regular view, except that the server maintains the actual data on disk, and updates it in real-time whenever there are changes to the underlying tables.
CREATE OR ALTER VIEW dbo.UserCategoryCount
WITH SCHEMABINDING
AS
SELECT
u.Category,
COUNT_BIG(*) AS numUsers
FROM dbo.User u
GROUP BY u.Category;
GO
CREATE UNIQUE CLUSTERED INDEX CX_UserCategoryCount ON dbo.UserCategoryCount (Category);
There are some restrictions on indexed views, among them:
They must be schema-bound, and therefore underlying columns cannot be changed
All tables must be two-part, schema and table
Only joins allowed are INNER or CROSS, no LEFT/RIGHT/FULL/APPLY or derived tables, CTEs or subqueries.
If there is a GROUP BY, you must add COUNT_BIG, and the only other aggregate allowed is SUM

Retrieve login time of users among billion records

There is a table which keeps the login information of users:
UserID LoginTime MacAddress IPAdress
1 2017-02-05 20:02:40 -- 192.168.10.3
This table has billion of records, we are going to get the last login time of each user with different filters, for example in 6 month ago. Also, this table should be join with Users Table for retrieving users information, also filters on Users table may be requested for example :
Where UserName='xxxx' and Last_Login_Time in 6 Month Ago, and any other filters.
I know that there are ways like RowNumber and a way like this:
SELECT MAX(LoginTime) AS [Last Login Time], UserID
FROM UsersLoginHistory
GROUP BY UserID;
But these ways takes long time.
Can anyone suggest a better query (prefer to use offset for paging) for this issue?
With current data model you will need to read through all table anyway to retrieve information about all users and all their last logins. To make this kind of report fast, you should pre-calculate it.
You can suggest one of the following ways:
Store the last login time in the UsersLogin table. Your back-end should properly update this table in the same transaction with inserting into UsersLoginHistory.
Create index on UserID, LoginTime.
You can replicate the logic of #1 somewhere in the database (using after insert trigger, for example), but i do not recommend doing this, because business logic will eventually bloat in your database.

How to find out what is acting on an Oracle database table?

There is this table in my Oracle database that is used to store audit information.
When I first did a SELECT * on that table, the audit timestamps were all on the same day, within the same hour (e.g. 18/10/2013 15:06:45, 18/10/2013 15:07:29); the next time I did it, the previous entries were gone, and the table then only contained entries with the 16:mm:ss timestamp.
I think something is acting on that table, such that every interval the table contents is/may be backed up to somewhere - I don't know where, and then the table is cleared. However, as I'm not familiar with databases, I'm not sure what is doing this.
I'd like to know how I can find out what is acting on this table, so that I can in turn retrieve the previous data I need.
EDIT:
What I've tried thus far...
SELECT * FROM DBA_DEPENDENCIES WHERE REFERENCED_NAME='MY_AUDIT_TABLE';
I got back four results, but all of which were (based on my programming skills) talking about putting data into the table, none about backing it up anywhere.
SELECT * FROM MY_AUDIT_TABLE AS OF TIMESTAMP ...
This only gives me a snapshot at a certain time, but since the table is being updated very frequently, it does not make sense for me to query every second.
The dba_dependencies view will give you an idea on what procedures, function etc will act on the table
SELECT * FROM DBA_DEPENDENCIES WHERE REFERENCED_NAME='MY_AUDIT_TABLE';
where MY_AUDIT_TABLE is the audit table name
if the table's synonym is used in the database then
SELECT * FROM DBA_DEPENDENCIES WHERE REFERENCED_NAME='MY_AUDIT_TABLE_SYNONYM';
where MY_AUDIT_TABLE_SYNONYM is the synonym for MY_AUDIT_TABLE
Or if any triggers are acting on the table
Select * from dba_triggers where table_name='MY_AUDIT_TABLE';
for external script to process the table
you can request DBA to turn on DB Fine grained audit for the table
Then query view DBA_FGA_AUDIT_TRAIL with timestamp between 15:00:00 and 16:00:00 to check the external call(OS_PROCESS column will give Operating System Process ID) or what SQL(SQL_TEXT) is executing on the table

data synchronization from unreliable data source to SQL table

I am looking for pattern, framework or best practice to handle a generic problem of application level data synchronisation.
Let's take an example with only 1 table to make it easier.
I have an unreliable datasource of product catalog. Data can occasionally be unavailable or incomplete or inconsistent. ( issue might come from manual data entry error, ETL failure...)
I have a live copy in a Mysql table in use by a live system. Let's say a website.
I need to implement safety mecanism when updating the mysql table to "synchronize" with original data source. Here are the safety criteria and the solution I an suggesting:
avoid deleting records when they temporarily disappear from datasource => use "deleted" boulean/date column or an archive/history table.
check for inconsistent changes => configure rules per columns such as : should never change, should only increment,
check for integrity issue => (standard problem, no point discussing approach)
ability to rollback last sync=> restore from history table ? use a version inc/date column ?
What I am looking for is best practice and pattern/tool to handle such problem. If not you are not pointing to THE solution, I would be grateful of any keywords suggestion that would me narrow down which field of expertise to explore.
We have the same problem importing data from web analytics providers - they suffer the same problems as your catalog. This is what we did:
Every import/sync is assigned a unique id (auto_increment int64)
Every table has a history table that is identical to the original, but has an additional column "superseded_id" which gets the import-id of the import, that changed the row (deletion is a change) and the primary key is (row_id,superseded_id)
Every UPDATE copies the row to the history table before changing it
Every DELETE moves the row to the history table
This makes rollback very easy:
Find out the import_id of the bad import
REPLACE INTO main_table SELECT <everything but superseded_id> FROM history table WHERE superseded_id=<bad import id>
DELETE FROM history_table WHERE superseded_id>=<bad import id>
For databases, where performance is a problem, we do this in a secondary database on a different server, then copy the found-to-be-good main table to the production database into a new table main_table_$id with $id being the highest import id and have main_table be a trivial view to SELECT * FROM main_table_$someid. Now by redefining the view to SELECT * FROM main_table_$newid we can atomically swicth the table.
I'm not aware of a single solution to all this - probably because each project is so different. However, here are two techniques I've used in the past:
Embed the concept of version and validity into your data model
This is a way to deal with change over time without having to resort to history tables; it does complicate your queries, so you should use it sparingly.
For instance, instead of having a product table as follows
PRODUCTS
Product_ID primary key
Price
Description
AvailableFlag
In this model, if you want to delete a product, you execute "delete from product where product_id = ..."; modifying price would be "update products set price = 1 where product_id = ...."
With the versioned model, you have:
PRODUCTS
product_ID primary key
valid_from datetime
valid_until datetime
deleted_flag
Price
Description
AvailableFlag
In this model, deleting a product requires you to update products set valid_until = getdate() where product_id = xxx and valid_until is null, and then insert a new row with the "deleted_flag = true".
Changing price works the same way.
This means that you can run queries against your "dirty" data and insert it into this table without worrying about deleting items that were accidentally missed off the import. It also allows you to see the evolution of the record over time, and roll-back easily.
Use a ledger-like mechanism for cumulative values
Where you have things like "number of products in stock", it helps to create transactions to modify the amount, rather than take the current amount from your data feed.
For instance, instead of having a amount_in_stock column on your products table, have a "product_stock_transaction" table:
product_stock_transactions
product_id FK transaction_date transaction_quantity transaction_source
1 1 Jan 2012 100 product_feed
1 2 Jan 2012 -3 stock_adjust_feed
1 3 Jan 2012 10 product_feed
On 2 Jan, the quantity in stock was 97; on 3 Jan, 107.
This design allows you to keep track of adjustments and their source, and is easier to manage when moving data from multiple sources.
Both approaches can create large amounts of data - depending on the number of imports and the amount of data - and can lead to complex queries to retrieve relatively simple data sets.
It's hard to plan for performance concerns up front - I've seen both "history" and "ledger" work with large amounts of data. However, as Eugen says in his comment below, if you get to an excessively large ledger, it may be necessary to to clean up the ledger table by summarizing the current levels, and deleting (or archiving) old records.

Resources