Snowflake Data Masking - snowflake-cloud-data-platform

We're looking to mask certain PII in our Snowflake environment where it relates to team members, and at the moment our masking is set up to mask every row in the column we define in our masking policies.
What we'd like to get to though is only masking only rows contain a membership number in a separate table. Is that possible to implement or how would I go about doing it?
member
name
A
acds
B
asdas
C
asdeqw
member
B
Just as an example, in the above tables, we'd only want to mask member B. At the moment, all 3 rows in the first table would be masked.
We've a possible workaround in doing this in logic of an extra view but that's actually altering the data, whereas our hope was we could use the Dynamic Data Masking and then have exception processes for it.

Prepare the data:
create or replace table member (member_id varchar, name varchar);
insert into member values ('A', 'member_a'),('B', 'member_b'),('C', 'member_c');
create or replace table member_to_be_masked(member_id varchar);
insert into member_to_be_masked values ('B');
If you want to mask the member column:
create or replace masking policy member_mask as (val string) returns string ->
case
when exists
(
select
member_id
from
member_to_be_masked
where member_id = val
)
then '********'
else val
end;
alter table if exists member
modify column member_id
set masking policy member_mask;
select * from member;
+-----------+----------+
| MEMBER_ID | NAME |
|-----------+----------|
| A | member_a |
| ******** | member_b |
| C | member_c |
+-----------+----------+
However, if you want to mask the name column, I don't see an easy way. I have tried the policy to link back to the table itself to find out if the member_id for the current column name value, but it fails with the below error message:
Policy body contains a UDF or Select statement that refers to a Table attached to another Policy.
It looks like that in the policy, we can't reference back to the source table. And because the policy will only get the value of the defined column value, it has no knowledge of other column values, so we can't make decision on whether to apply the mask or not.
If can work if you also store the "name" into the mapping table, together with the member_id, like the below:
create or replace table member (member_id varchar, name varchar);
insert into member values ('A', 'member_a'),('B', 'member_b'),('C', 'member_c');
create or replace table member_to_be_masked(member_id varchar, name varchar);
insert into member_to_be_masked values ('B', 'member_b');
create or replace masking policy member_mask as (val string) returns string ->
case
when exists
(
select member_id
from member_to_be_masked
where name = val
)
then '********'
else val
end;
alter table if exists member
modify column name
set masking policy member_mask;
select * from member;
+-----------+----------+
| MEMBER_ID | NAME |
|-----------+----------|
| A | member_a |
| B | ******** |
| C | member_c |
+-----------+----------+
The downside of this approach is that if different members with the same name, all members with this name will be masked, regardless if the member's id is in the mapping table or not.

Keeping my previous answer in case it can still be useful.
Another workaround I can think of is to use variant data and then create a view on top of it.
prepare the data in JSON format:
create or replace table member_json (member_id varchar, data variant);
insert into member_json
select
'A', parse_json('{"member_id": "A", "name" : "member_a"}')
union
select
'B', parse_json('{"member_id": "B", "name" : "member_b"}')
union
select
'C', parse_json('{"member_id": "C", "name" : "member_c"}')
;
create or replace table member_to_be_masked(member_id varchar);
insert into member_to_be_masked values ('B');
Data looks like below:
select * from member_json;
+-----------+----------------------+
| MEMBER_ID | DATA |
|-----------+----------------------|
| A | { |
| | "member_id": "A", |
| | "name": "member_a" |
| | } |
| B | { |
| | "member_id": "B", |
| | "name": "member_b" |
| | } |
| C | { |
| | "member_id": "C", |
| | "name": "member_c" |
| | } |
+-----------+----------------------+
select * from member_to_be_masked;
+-----------+
| MEMBER_ID |
|-----------|
| B |
+-----------+
create a JS UDF:
create or replace function json_mask(mask boolean, v variant)
returns variant
language javascript
as
$$
if (MASK) {
V["member_id"] = '******'
V["name"] = '******';
}
return V;
$$;
create a masking policy using the UDF:
create or replace masking policy member_mask
as (val variant)
returns variant ->
case
when exists
(
select
member_id
from
member_to_be_masked
where member_id = val['member_id']
)
then json_mask(true, val)
else val
end;
apply the policy to the member_json table:
alter table if exists member_json
modify column data
set masking policy member_mask;
query the table will see member B masked:
select * from member_json;
+-----------+--------------------------+
| MEMBER_ID | DATA |
|-----------+--------------------------|
| A | { |
| | "member_id": "A", |
| | "name": "member_a" |
| | } |
| B | { |
| | "member_id": "******", |
| | "name": "******" |
| | } |
| C | { |
| | "member_id": "C", |
| | "name": "member_c" |
| | } |
+-----------+--------------------------+
create a view on top of it:
create or replace view member_view
as
select
data:"member_id" as member_id,
data:"name" as name
from member_json;
query the view will see masked data as well:
select * from member_view;
+-----------+------------+
| MEMBER_ID | NAME |
|-----------+------------|
| "A" | "member_a" |
| "******" | "******" |
| "C" | "member_c" |
+-----------+------------+
Not sure if this can help in your case use.

As I understand, you want to mask one column in your table based on another column and also lookup.
We can use conditional masking in this case - https://docs.snowflake.com/en/sql-reference/sql/create-masking-policy.html#conditional-masking-policy
create or replace masking policy name_mask as (val string, member_id string) returns string ->
case
when exists
(
select 1
from member_to_be_masked m
where m.member_id = member_id
)
then '********'
else val
end;
In query profile, it would come as a secure function. Please evaluate the peformance. Based on the total records for which this function has to be applied, peformance difference may be significant

Related

Create SQL Server Select/Delete Query from value in other table

I have a master table named Master_Table and the columns and values in the master table are below:
| ID | Database | Schema | Table_name | Common_col | Value_ID |
+-------+------------+--------+-------------+------------+----------+
| 1 | Database_1 | Test1 | Test_Table1 | Test_ID | 1 |
| 2 | Database_2 | Test2 | Test_Table2 | Test_ID | 1 |
| 3 | Database_3 | Test3 | Test_Table3 | Test_ID2 | 2 |
I have another Value_Table which consist of values that need to be deleted.
| Value_ID | Common_col | Value |
+----------+------------+--------+
| 1 | Test_ID | 110 |
| 1 | Test_ID | 111 |
| 1 | Test_ID | 115 |
| 2 | Test_ID2 | 999 |
I need to build a query to create a SQL query to delete the value from the table provided in Master_Table whose database and schema information is provided in the same row. The column that I need to refer to delete the record is given in Common_col column of master table and the value I need to select is in Value column of Value_Table.
The result of my query should create a query as given below :
DELETE FROM Database_1.Test1.Test_Table1 WHERE Test_ID=110;
or
DELETE FROM Database_1.Test1.Test_Table1 WHERE Test_ID in (110,111,115);
These query should be inside a loop so that I can delete all the row from all the database and tables provided in master table.
Queries don't really create queries.
One way to do what you're saying, which could be useful if this is a one time thing or very occasional thing, is to use SSMS to generate query statements, then copy them to the clipboard, paste them into the window, and execute there.
SELECT 'DELETE FROM Database_1.Test1.Test_Table1 WHERE '
+ common_col
+ ' = '
+ convert(VARCHAR(10),value)
This probably isn't what you want; it sounds more like you want to automate cleanup or something.
You can turn this into one big query if you don't mind repeating yourself a little:
DELETE T1
FROM Database_1.Test1.Test_Table1 T1
INNER JOIN Database_1.Test1.ValueTable VT ON
(VT.common_col = 'Test_ID' and T1.Test_ID=VT.Value) OR
(VT.common_col = 'Test_ID2' and T1.Test_ID2=VT.Value)
You can also use dynamic SQL combined with the first part ... but I hate dynamic SQL so I'm not going to put it in my answer.

How to mark selected columns of a table for display

I use PostgreSQL 10.1 and:
CREATE TABLE human
(
id ... NOT NULL,
gender ...,
height ...,
weight ...,
eye ...,
hair ...,
...
);
I have an input form through which I insert the data. I wish an elegant and proper way by which I can SELECT which columns required to be DISPLAYED in that form, something like weight ... DISPLAYED, or eye ... NOT DISPLAYED, .
One way is to correspond NULL with DISPLAYED (when NOT NULL then display it, or when NULL then do not display it) and use information_schema which (corresponding) makes me no so happy:
Another way is to:
CREATE TABLE human_column
(
id ... NOT NULL,
characteristic character varying(...),
is_displayed boolean
);
where characteristic data are the names of the columns of human table.
Is there a better way to add a direct foreign attribute to the columns of a table? (In 51.7. pg_attribute there is a column named attoptions. Would it be used?)
specifying "options" for columns to define if they will be "displayed" or not seems a little overhead. Imagine you keep such list in human_column. To modify it you would need to update it with new is_displayed values. Then you would need to build column list to be selected in query.
When you create a view, you do the same (specify a list of columns to be displayed) and then you can just query the view, without need to dynamically build the query. Also you can always check the current list of included columns from catalog or information_schema.
The only "not cosy" feature of a view - you can't change columns in it, thus you have to drop and create it again.
drop/create view on demand looks cheaper to me then dynamically building query with list of columns on each select still.
demo:
db=# create view v as select oid,datname from pg_database;
CREATE VIEW
db=# select * from v;
oid | datname
-------+-----------
13505 | postgres
16384 | t
1 | template1
13504 | template0
16419 | o
(5 rows)
checking list of columns:
db=# select column_name,ordinal_position,column_default,is_nullable,data_type,character_maximum_length from information_schema.columns where table_name = 'v';
column_name | ordinal_position | column_default | is_nullable | data_type | character_maximum_length
-------------+------------------+----------------+-------------+-----------+--------------------------
oid | 1 | | YES | oid |
datname | 2 | | YES | name |
(2 rows)
same for original table:
db=# select column_name,ordinal_position,column_default,is_nullable,data_type,character_maximum_length from information_schema.columns where table_name = 'pg_database';
column_name | ordinal_position | column_default | is_nullable | data_type | character_maximum_length
---------------+------------------+----------------+-------------+-----------+--------------------------
datname | 1 | | NO | name |
datdba | 2 | | NO | oid |
encoding | 3 | | NO | integer |
datcollate | 4 | | NO | name |
datctype | 5 | | NO | name |
datistemplate | 6 | | NO | boolean |
datallowconn | 7 | | NO | boolean |
datconnlimit | 8 | | NO | integer |
datlastsysoid | 9 | | NO | oid |
datfrozenxid | 10 | | NO | xid |
datminmxid | 11 | | NO | xid |
dattablespace | 12 | | NO | oid |
datacl | 13 | | YES | ARRAY |
(13 rows)

SQL Server - Is it possible to define a table column as a table?

I know that this is possible in Oracle and I wonder if SQL Server also supports it (searched for answer without success).
It would greatly simplify my life in the current project if I could define a column of a table to be a table itself, something like:
Table A:
Column_1 Column_2
+----------+----------------------------------------+
| 1 | Columns_2_1 Column_2_2 |
| | +-------------+------------------+ |
| | | 'A' | 12345 | |
| | +-------------+------------------+ |
| | | 'B' | 777777 | |
| | +-------------+------------------+ |
| | | 'C' | 888888 | |
| | +-------------+------------------+ |
+----------+----------------------------------------+
| 2 | Columns_2_1 Column_2_2 |
| | +-------------+------------------+ |
| | | 'X' | 555555 | |
| | +-------------+------------------+ |
| | | 'Y' | 666666 | |
| | +-------------+------------------+ |
| | | 'Z' | 000001 | |
| | +-------------+------------------+ |
+----------+----------------------------------------+
Thanks in advance.
There is one option where you can store data as XML
Declare #YourTable table (ID int,XMLData xml)
Insert Into #YourTable values
(1,'<root><ID>1</ID><Active>1</Active><First_Name>John</First_Name><Last_Name>Smith</Last_Name><EMail>john.smith#email.com</EMail></root>')
,(2,'<root><ID>2</ID><Active>0</Active><First_Name>Jane</First_Name><Last_Name>Doe</Last_Name><EMail>jane.doe#email.com</EMail></root>')
Select ID
,Last_Name = XMLData.value('(root/Last_Name)[1]' ,'nvarchar(50)')
,First_Name = XMLData.value('(root/First_Name)[1]' ,'nvarchar(50)')
From #YourTable
Returns
ID Last_Name First_Name
1 Smith John
2 Doe Jane
Actually, for a normalized database we do not require such functionality.
Because if we need to insert a table within a column than we can create a child table and reference it as a foreign key in the parent table.
In spite, if you still insist to such functionality than you can use SQL Server 2016 to support JSON data where you can store any associative list in JSON format.
Like:
DECLARE #json NVARCHAR(4000)
SET #json =
N'{
"info":{
"type":1,
"address":{
"town":"Bristol",
"county":"Avon",
"country":"England"
},
"tags":["Sport", "Water polo"]
},
"type":"Basic"
}'
SELECT
JSON_VALUE(#json, '$.type') as type,
JSON_VALUE(#json, '$.info.address.town') as town,
JSON_QUERY(#json, '$.info.tags') as tags
SELECT value
FROM OPENJSON(#json, '$.info.tags')
In older versions, this can be achieved through xml as shown in previous answer.
Your can also make use of "sql_variant" datatype to map your table.
Previously, I was also in search of such features as available in Oracle. But after reading various articles and blogs from experts, I was convinced, such features will make the things more complex beside helping.
Only storing the data in required format is not important, It is worthy when it is also efficiently available (readable).
Hope this will help you to take your decision.

Troubleshooting to implement SQL Server trigger

I have this table called InspectionsReview:
CREATE TABLE InspectionsReview
(
ID int NOT NULL AUTO_INCREMENT,
InspectionItemId int,
SiteId int,
ObjectId int,
DateReview DATETIME,
PRIMARY KEY (ID)
);
Here how the table looks:
+----+------------------+--------+-----------+--------------+
| ID | InspectionItemId | SiteId | ObjectId | DateReview |
+----+------------------+--------+-----------+--------------+
| 1 | 3 | 3 | 3045 | 20-05-2016 |
| 2 | 5 | 45 | 3025 | 01-03-2016 |
| 3 | 4 | 63 | 3098 | 05-05-2016 |
| 4 | 5 | 5 | 3041 | 03-04-2016 |
| 5 | 3 | 97 | 3092 | 22-02-2016 |
| 6 | 1 | 22 | 3086 | 24-11-2016 |
| 7 | 9 | 24 | 3085 | 15-12-2016 |
+----+------------------+--------+-----------+--------------+
I need to write trigger that checks before the new row is inserted to the table if the table already has row with columns values 'ObjectId' and 'DateReview' that equal to the columns values of the row that have to be inserted, if it's equal I need to get the ID of the exited row and to put to trigger variable called duplicate .
For example, if new row that has to be inserted is:
INSERT INTO InspectionsReview (InspectionItemId, SiteId, ObjectId, DateReview)]
VALUES (4, 63, 3098, '05-05-2016');
The duplicate variable in SQL Server trigger must be equal to 3.
Because the row in InspectionsReview table were ID = 3 has ObjectId and DateReview values the same as in new row that have to be inserted. How can I implement this?
With the extra assumption that you want to log all the duplicate to a different table, then my solution would be to create an AFTER trigger that would check for the duplicate and insert it into your logging table.
Of course, whether this is the solution depends on whether my extra assumption is valid.
Here is my logging table.
CREATE TABLE dbo.InspectionsReviewLog (
ID int
, ObjectID int
, DateReview DATETIME
, duplicate int
);
Here is the trigger (pretty straightforward with the extra assumption)
CREATE TRIGGER tr_InspectionsReview
ON dbo.InspectionsReview
AFTER INSERT
AS
BEGIN
DECLARE #tableVar TABLE(
ID int
, ObjectID int
, DateReview DATETIME
);
INSERT INTO #tableVar (ID, ObjectID, DateReview)
SELECT DISTINCT inserted.ID, inserted.ObjectID, inserted.DateReview
FROM inserted
JOIN dbo.InspectionsReview ir ON inserted.ObjectID=ir.ObjectID AND inserted.DateReview=ir.DateReview AND inserted.ID <> ir.ID;
INSERT INTO dbo.InspectionsReviewLog (ID, ObjectID, DateReview, duplicate)
SELECT ID, ObjectID, DateReview, 3
FROM
#tableVar;
END;

MySQL I want to optimize this further

So I started off with this query:
SELECT * FROM TABLE1 WHERE hash IN (SELECT id FROM temptable);
It took forever, so I ran an explain:
mysql> explain SELECT * FROM TABLE1 WHERE hash IN (SELECT id FROM temptable);
+----+--------------------+-----------------+------+---------------+------+---------+------+------------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-----------------+------+---------------+------+---------+------+------------+-------------+
| 1 | PRIMARY | TABLE1 | ALL | NULL | NULL | NULL | NULL | 2554388553 | Using where |
| 2 | DEPENDENT SUBQUERY | temptable | ALL | NULL | NULL | NULL | NULL | 1506 | Using where |
+----+--------------------+-----------------+------+---------------+------+---------+------+------------+-------------+
2 rows in set (0.01 sec)
It wasn't using an index. So, my second pass:
mysql> explain SELECT * FROM TABLE1 JOIN temptable ON TABLE1.hash=temptable.hash;
+----+-------------+-----------------+------+---------------+----------+---------+------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+------+---------------+----------+---------+------------------------+------+-------------+
| 1 | SIMPLE | temptable | ALL | hash | NULL | NULL | NULL | 1506 | |
| 1 | SIMPLE | TABLE1 | ref | hash | hash | 5 | testdb.temptable.hash | 527 | Using where |
+----+-------------+-----------------+------+---------------+----------+---------+------------------------+------+-------------+
2 rows in set (0.00 sec)
Can I do any other optimization?
You can gain some more speed by using a covering index, at the cost of extra space consumption. A covering index is one which can satisfy all requested columns in a query without performing a further lookup into the clustered index.
First of all get rid of the SELECT * and explicitly select the fields that you require. Then you can add all the fields in the SELECT clause to the right hand side of your composite index. For example, if your query will look like this:
SELECT first_name, last_name, age
FROM table1
JOIN temptable ON table1.hash = temptable.hash;
Then you can have a covering index that looks like this:
CREATE INDEX ix_index ON table1 (hash, first_name, last_name, age);

Resources